Mark DePristo
71b4bb12b7
Bug fix for incorrect logic in subsetSamples
...
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks
35fc2e13c3
Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.
2011-12-14 15:31:09 -05:00
Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Eric Banks
988d60091f
Forgot to add in the new result class
2011-12-14 13:37:15 -05:00
Eric Banks
106bf13056
Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.
2011-12-14 12:05:50 -05:00
Eric Banks
7648521718
Add check for mixed genotype so that we don't exception out for a valid record
2011-12-14 11:26:43 -05:00
Eric Banks
9497e9492c
Bug fix for complex records: do not ever reverse clip out a complete allele.
2011-12-14 11:21:28 -05:00
Eric Banks
09a5a9eac0
Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.
2011-12-14 10:43:52 -05:00
Eric Banks
d3f4a5a901
Fail gracefully when encountering malformed VCFs without enough data columns
2011-12-14 10:37:38 -05:00
Eric Banks
079932ba2a
The log10cache needs to be larger if we want to handle 10K samples in the UG.
2011-12-13 23:36:10 -05:00
Ryan Poplin
7fa1ab1bae
Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test
2011-12-13 17:19:40 -05:00
Eric Banks
e47a113c9f
Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?
2011-12-12 23:02:45 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro
a70a0f25fb
Better debug output for SAMDataSource
...
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo
d03425df2f
TODO optimization targets
2011-12-12 17:39:51 -05:00
Laurent Francioli
7cf27bb66e
Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.
2011-12-12 12:22:43 +01:00
Laurent Francioli
025bdfe2cc
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-12 12:19:44 +01:00
Eric Banks
7b6338c742
Merge branch 'master' into trialleles
2011-12-11 00:28:46 -05:00
Eric Banks
7c4b9338ad
The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.
2011-12-11 00:23:33 -05:00
Eric Banks
044f211a30
Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.
2011-12-10 23:57:14 -05:00
Eric Banks
364f1a030b
Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.
2011-12-09 14:25:28 -05:00
Mauricio Carneiro
8475328b2c
Turning off test that breaks read clipper
...
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki
4cbd1f0dec
Reorganized the testing code and created ClipReadsTestUtils
...
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
*Still needs debugging because code is not working.
Created test classes to be used across several tests.
Some cases are still commented out.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00
Roger Zurawicki
0e9c2cefa2
testHardClipSoftClippedBases works with Matches and Deletions
...
Insertions are a problem so cigar cases with "I" are commented out.
The test works with multiple deletions and matches.
This is still not a complete test. A lot of cigar test cases are commented out.
Added insertions to ReadClipperUnitTest
ReadClipper now tests for all indels.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:43:37 -05:00
Eric Banks
64dad13e2d
Don't carry around an extra copy of the code for the Haplotype Caller
2011-12-09 11:09:40 -05:00
Eric Banks
442ceb6ad9
The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.
2011-12-09 10:16:44 -05:00
Laurent Francioli
a79144f7db
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-09 15:57:24 +01:00
Laurent Francioli
72fbfba97d
Added UnitTests for getFamilies() and getChildrenWithParents()
2011-12-09 15:57:07 +01:00
Laurent Francioli
5a06170804
Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.
2011-12-09 14:51:34 +01:00
Eric Banks
aa4a8c5303
No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).
2011-12-09 02:25:06 -05:00
Eric Banks
2fe50c64da
Updating md5s
2011-12-09 00:47:01 -05:00
Eric Banks
8777288a9f
Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.
2011-12-09 00:00:20 -05:00
Eric Banks
3e7714629f
Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.
2011-12-08 23:50:54 -05:00
Eric Banks
4aebe99445
Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.
2011-12-08 15:31:02 -05:00
Eric Banks
7750bafb12
Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.
2011-12-08 13:50:50 -05:00
Guillermo del Angel
252e0f3d0a
Merged bug fix from Stable into Unstable
2011-12-08 13:11:39 -05:00
Guillermo del Angel
1bfe28067f
Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.
2011-12-08 12:54:08 -05:00
Mark DePristo
9def841275
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 13:36:16 -05:00
Mark DePristo
4055877708
Prints 0.0 TiTv not NaN when there are no variants
...
-- Updated md5
2011-12-07 12:07:54 -05:00
Matt Hanna
15533e08df
Fixed issue with RODWalker parallelization.
...
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
2011-12-07 11:55:42 -05:00
Mark DePristo
5d2212bc8e
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 09:03:17 -05:00
Mark DePristo
6bf18899df
Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs
2011-12-07 09:02:49 -05:00
Matt Hanna
c9b2cd8ba5
Fix for chartl's stale null representation issue.
2011-12-06 18:05:17 -05:00
Eric Banks
79d18dc078
Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.
2011-12-06 16:17:18 -05:00
Matt Hanna
f5b977fc88
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-06 10:11:35 -05:00
Matt Hanna
4001c22a11
Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.
2011-12-06 10:10:38 -05:00
Khalid Shakir
677bea0abd
Right aligning GATKReport numeric columns and updated MD5s in tests.
...
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks
7a0f6feda4
Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.
2011-12-05 16:18:52 -05:00
Eric Banks
7fac4afab3
Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).
2011-12-05 15:57:25 -05:00
Eric Banks
a7cb941417
The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.
2011-12-04 13:02:53 -05:00
Eric Banks
eab2b76c9b
Added loads of comments for future reference
2011-12-03 23:54:42 -05:00
Eric Banks
29662be3d7
Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).
2011-12-03 23:12:04 -05:00
Eric Banks
71f793b71b
First partially working version of the multi-allelic version of the Exact AF calculation
2011-12-02 14:13:14 -05:00
David Roazen
d014c7faf9
Queue now properly escapes all shell arguments in generated shell scripts
...
This has implications for both Qscript authors and CommandLineFunction authors.
Qscript authors:
You no longer need to (and in fact must not) manually escape String values to
avoid interpretation by the shell when setting up Walker parameters. Queue will
safely escape all of your Strings for you so that they'll be interpreted literally. Eg.,
Old way:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
New way:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
CommandLineFunction authors:
If you're writing a one-off CommandLineFunction in a Qscript and don't really
care about quoting issues, just keep doing things the direct, simple way:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and
will be used by other QScripts, however, it's advisable to do things the
newer, safer way, ie.:
When you construct your commandLine, you should do so ONLY using the API methods
required(), optional(), conditional(), and repeat(). These will manage quoting
and whitespace separation for you, so you shouldn't insert quotes/extraneous
whitespace in your Strings. By default you get both (quoting and whitespace
separation), but you can disable either of these via parameters. Eg.,
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted
required(outVcf)
I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new
system, so you'll get free shell escaping when you use those in Qscripts just
like with walkers.
2011-12-01 18:13:44 -05:00
Mark DePristo
3060a4a15e
Support for list of known CNVs in VariantEval
...
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Matt Hanna
b65db6a854
First draft of a test script for I/O performance with the new asynchronous I/O processing.
...
Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when
Queue gets better support for tagged arguments (hopefully soon).
2011-11-30 13:13:16 -05:00
Laurent Francioli
1d5d200790
Cleaned up unused import statements
2011-11-30 15:30:30 +01:00
Mark DePristo
28b286ad39
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 09:11:53 -05:00
Laurent Francioli
20bffe0430
Adapted for the new version of MendelianViolation
2011-11-30 14:46:38 +01:00
Laurent Francioli
1cb5e9e149
Removed outdated (and unused) -familyStr commandline argument
2011-11-30 14:45:04 +01:00
Laurent Francioli
9574be0394
Updated MendelianViolationEvaluator integration test
2011-11-30 14:44:15 +01:00
Laurent Francioli
f49dc5c067
Added functionality to get all children that have both parents (useful when trios are needed)
2011-11-30 14:43:37 +01:00
Laurent Francioli
a4606f9cfe
Merge branch 'MendelianViolation'
...
Conflicts:
public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli
b279ae4ead
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 10:10:21 +01:00
Laurent Francioli
7d58db626e
Added MendelianViolationEvaluator integration test
2011-11-30 10:09:20 +01:00
Ryan Poplin
91413cf0d9
Merged bug fix from Stable into Unstable
2011-11-29 14:01:23 -05:00
Ryan Poplin
cb284eebde
Further updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 14:00:57 -05:00
Ryan Poplin
dcb889665d
Merged bug fix from Stable into Unstable
2011-11-29 09:58:49 -05:00
Ryan Poplin
447e9bff9e
Updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 09:57:45 -05:00
Ryan Poplin
110298322c
Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it.
2011-11-29 09:29:18 -05:00
Laurent Francioli
ab67011791
Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified
2011-11-29 11:18:15 +01:00
Eric Banks
d7d8b8e380
Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.
2011-11-28 14:18:28 -05:00
Laurent Francioli
a09c01fcec
Removed walker argument FamilyStructure as this is now supported by the engine (ped file)
2011-11-28 17:18:11 +01:00
Laurent Francioli
795c99d693
Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too.
...
MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs
2011-11-28 17:13:14 +01:00
Laurent Francioli
e877db8f42
Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too.
2011-11-28 17:11:30 +01:00
Laurent Francioli
5c2595701c
Added a function to get families only for a given list of samples.
2011-11-28 17:10:33 +01:00
Mark DePristo
3c36428a20
Bug fix for TiTv calculation -- shouldn't be rounding
2011-11-28 10:20:34 -05:00
Eric Banks
436b4dc855
Updated docs
2011-11-28 08:59:48 -05:00
Laurent Francioli
b1dd632d5d
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
...
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
2011-11-25 16:16:44 +01:00
Mark DePristo
e60272975a
Fix for changed MD5 in streaming VCF test
2011-11-23 19:01:33 -05:00
Mark DePristo
12f09d88f9
Removing references to SimpleMetricsByAC
2011-11-23 16:08:18 -05:00
Mark DePristo
e319079c32
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-23 13:02:11 -05:00
Mark DePristo
4107636144
VariantEval updates
...
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
David Roazen
e5b85f0a78
A toString() method for IntervalBindings
...
Necessary since we're currently writing things like this to our VCF headers:
intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]
2011-11-23 11:56:12 -05:00
Mark DePristo
5a4856b82e
GATKReports now support a format field per column
...
-- You can tell the table to format your object with "%.2f" for example.
2011-11-23 11:31:04 -05:00
Mark DePristo
c8bf7d2099
Check for null comment
2011-11-23 10:47:21 -05:00
Mark DePristo
6c2555885c
Caching getSimpleName() in VariantEval is a big performance improvement
...
-- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table
2011-11-23 08:34:05 -05:00
Guillermo del Angel
32adbd614f
Solve merge conflict
2011-11-22 22:48:46 -05:00
Guillermo del Angel
941f3784dc
Solve merge conflict
2011-11-22 22:48:03 -05:00
Guillermo del Angel
75d93e6335
Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left
2011-11-22 22:46:12 -05:00
Mark DePristo
a3aef8fa53
Final performance optimization for GenotypesContext
2011-11-22 17:19:30 -05:00
Mark DePristo
990c02e4de
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-22 17:19:11 -05:00
Guillermo del Angel
38a90da92c
Fixed merge conflict to Unstable
2011-11-22 14:39:45 -05:00
Guillermo del Angel
32a77a8a56
Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle.
2011-11-22 13:57:24 -05:00
Eric Banks
5821c11fad
For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead.
2011-11-22 10:50:22 -05:00
Mark DePristo
7087310373
Embarassing bug fixed
2011-11-22 10:16:36 -05:00
Mark DePristo
e484625594
GenotypesContext now updates cached data for add, set, replace operations when possible
...
-- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo
29ca24694a
UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0
...
A few updated UGs integration tests
2011-11-22 08:22:32 -05:00
Mark DePristo
2b51c01df4
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-21 19:16:06 -05:00
Mark DePristo
5443d3634a
Again, fixing the add call when we really mean replace
...
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00