gatk-3.8

Commit Graph

Author	SHA1	Message	Date
depristo	d01d4fdeb5	Optimized version of produce beagle tool, along with experimental (hidden) support for combining likelihoods depending on estimate false positive rate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5430 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-12 02:06:28 +00:00
rpoplin	b3464a6031	Initial commit of VQSRv2 that passes the old integration tests. Not ready to be used yet unless your name rhymes with ... oh wait, that's me. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5419 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 15:18:34 +00:00
hanna	7a22f19366	More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and ease-of-use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 03:55:24 +00:00
asivache	2f2aa339d9	Now makes all pairs, not only the good ones. The logic of selecting the "best" pair when the data are messy (e.g. multiple alignments available for an end) is still very naive git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5303 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 16:21:26 +00:00
asivache	abf3fcbb72	Little changes in recognized annotation terms; columns in annotated maf are now prioritized and multiple alternatives do not cause 'i don't know what to do' crash: e.g. if Chromosome and chr columns are both present, then Chromosome is taken (has a priority). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5302 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 16:19:06 +00:00
ebanks	318035c147	Fixing up the output system of the Unified Genotyper. Deprecating the -all_bases and -genotype arguments. Adding instead the --output_mode (EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES, EMIT_ALL_SITES) and --genotyping_mode (DISCOVERY, GENOTYPE_GIVEN_ALLELES) arguments. UG now does the correct thing when passed alleles (bound to the 'alleles' rod) to use for genotyping; added several integration tests to cover this case. This commit will break the batched calls merging script, but Chris knows this and is ready for it... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5288 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-22 06:07:18 +00:00
depristo	1a5d296737	ReplaceReadGroups. Fixes BAM files without read group info. MissingReadGroup points people to this tool now. Please point users on the forum to this tool now. Will migrate to Picard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5284 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-21 14:02:41 +00:00
depristo	aa4a4e515d	Safer interface for ReorderSam. Better error checking. Documentation. Moving into Picard now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5283 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-20 14:35:44 +00:00
depristo	444bf83acf	A simple utility for reordering a BAM file based on a new reference sequence. This tool can be used to efficiently correct a lexicographically sorted BAM file git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5279 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-19 23:24:32 +00:00
asivache	7a11b4f35d	Another change in variant classification values git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5237 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-14 17:47:58 +00:00
asivache	7f7d7eb2d1	Inconsequential changes, more 'variant classification' values are recognized git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5236 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-14 17:36:39 +00:00
ebanks	4fe0fcd707	Updates to handle CG data, headers, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5215 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-08 03:16:05 +00:00
fromer	bceb2a9460	Now that Mauricio has updated the PacBio BAM to properly have RG, can use sample name in the walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5212 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-07 20:26:57 +00:00
asivache	2a04e0d378	Explicitly set logger's level to info - otherwise samtools is too chatty git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5209 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-07 17:08:50 +00:00
depristo	fe4aa58d35	Removing unused class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5197 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-04 22:22:28 +00:00
hanna	5c3198520c	A few minor modifications masquerading as significant changes according to svn's logs: - Copied BAM indexing engine from Picard back into the GATK anticipating shard merging algorithm. Tried to leave most of the building blocks in Picard. If this turns into a logistical nightmare, I'll merge the building blocks into the GATK as well. - Reorganized the org.broadinstitute.sting.gatk.datasources package, giving better separation of query and management functionality for reads, ref, rmd, and samples. - Merged Shard building blocks into org.broadinstitute.sting.gatk.datasources. reads package, indicating it's current strong relationship with the reads, rather than the general unifying element I wish this would be. - Collapsed BAMFormatAwareShard into Shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5184 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-03 17:59:19 +00:00
kiran	cab426f86f	VariantEval 3.0 is now in core. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5139 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-31 17:42:08 +00:00
kiran	e26da9b047	Changed column-key names to not have spaces, as GATKReport gets very upset about this. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5131 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-31 03:31:54 +00:00
asivache	7af0532292	An attempt to have more intelligent sorting of RODs. Tested with maf only so far. Should be able to reference-sort dbsnp, bed and vcf as well, bugs nonwithstanding. Very simple, brute-force implementation using SortingCollection. Should I have used tribble indexing machinery instead? git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5118 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 22:10:07 +00:00
asivache	fa8963522b	Ignore header line if it happens to be passed to the codec again, instead of crashing on it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5116 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 21:44:33 +00:00
fromer	f2de39d661	Calculates phase concordance rates between trio and RBP-phasing tracks, stratified by trio status (Het3, non-Het3) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5114 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 20:50:01 +00:00
kiran	9cb1ae384c	Constant precision for floating point numbers. Added integration test - carries over tests from VariantEval with the necessary modifications to command-line arguments and md5s. Disabled use of 'synchronized' keyword because I clearly don't get how that keyword is supposed to work yet... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5107 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 05:19:18 +00:00
asivache	f036a178f1	Added support for MAF features. So far works for MAF Lite only, annotated MAF is NOT TESTED yet AT ALL. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5105 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 03:20:46 +00:00
kiran	3e9f185dad	Fixed issue with GenotypeConcordance being initialized incorrectly when the first seen comptrack had no samples. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5102 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 01:12:27 +00:00
kiran	58f0ecff89	Fixes to support evaluations with TableType elements - each such object now gets a separate entry in the output table. Added codon degeneracy stratification. Handle null elements in reports (useful for debugging). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5101 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-27 22:09:59 +00:00
kiran	2901299ff6	Sets the number of samples to all of the samples in the file when it's not specifed on the command-line explicitly. GenotypeConcordance no longer a standard evaluation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5094 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-27 01:38:26 +00:00
asivache	43812a28fc	If among all the multiple alignments for the given read we have 'unmapped' ones (can happen with bwa 0.5.7 and maybe later versions), then discard the latters and keep only the mapped ones. Keep 'unmapped' only if its the only alignment available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5090 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 20:07:08 +00:00
asivache	63b709d992	When remapping the read, set MAPQ, CIGAR etc to 0/null for unmapped reads. This is not required according to spec but current samtools jdk otherwise dies in STRICT validation mode. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5089 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 19:49:07 +00:00
kiran	a97184fddf	Frick! Changed to refer to the playground version of VariantEvaluator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5087 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 19:33:03 +00:00
kiran	a9d0772516	When evaluating JEXL expressions, on't blow up if the eval VC is null git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5085 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 18:25:03 +00:00
kiran	22e599ec76	Fixed output report to properly handle evaluation modules with TableType objects. Promoted CpG to a standard stratification. Demoted Filter to a non-standard stratification. Now, if the filter stratification is not specified, VariantEval only evaluates PASSing sites. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5084 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 17:38:21 +00:00
ebanks	2bbcc9275a	Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-23 05:04:17 +00:00
fromer	4bec93e3e4	Permit retrieval of read names for debugging purposes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5011 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-18 16:09:34 +00:00
kiran	2f4a436719	Throw an exception if no eval rods are specified. If one or more samples are specified, subset the 'all' VariantContext to just the specified samples. This is useful when you want to see what effect dropping certain samples will have on the metrics and you don't want to go through SelectVariants first. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5009 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-17 06:46:10 +00:00
kiran	73acfa654a	Fixed double-counting bug. Fixed issue where evaluation module with an update2() method wasn't getting called if the comp track was null. Added a column to the output report indicating the table name for easy greppability. Fixed an issue where, if sample-level stratification was not required, the sample-level VCs would be generated anyway. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5000 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-14 14:06:43 +00:00
depristo	e3956148ac	removing unused fastqtobam git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4985 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-13 14:29:32 +00:00
fromer	c2dd956888	Moved PrintReferenceVariantsWalker to playground git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4971 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-10 22:07:41 +00:00
kiran	fdc514ded3	Intermediate commit for VariantEval 3.0. Among the changes: * Stratifications (by comp rod, by eval rod, novelty, filter status, etc.) have been generalized. They are very symmetric with evaluators now. Each stratification can have multiple states (e.g. known, novel, all). New stratifications can be added and optionally applied. Some new stratifications include: - by sample - by functional class - by CpG status * Output is to a single file in GATKReport format, rather than having the options of CSV, R, table, etc. * Rather than needing to state up front that the allowable variant type is a SNP or an indel, each eval record is inspected and the appropriate record type is fetched from the comp track. (This will require a bit more testing...) * Evaluation context (basically a single row in a VariantEval report) generation and retrieval has been overhauled. Now, every possible configuration of stratification state is generated recursively and stored in a HashMap. The key of the HashMap is a key that represents that exact state configuration. When examining a comp track and eval track, this key is computed based on the data, providing easy lookup for the appropriate evaluation context. When there are only a handful of stratification configurations, this isn't a big deal. But when operating on a file with hundreds of samples, multipled by 3 states for novelty, 3 states for filtration, 3 states for CpG status, etc., it becomes a very big deal. There are still some known issues: * When the per-sample stratification is turned off, things are getting overcounted (too many variants are showing up when compared to the VariantEval 2.0 code). It's probably because I break out the VariantContext by sample even when not necessary, and those irrelevant contexts are still being counted. Or my recursion is overaggressively creating evaluation contexts, and they all get added up in a weird way. But that's why I'm committing now - so I can track down this issue without losing my work so far. * The Jexl expressions are sometimes throwing an exception that I don't yet understand (they complain of an incorrect specification on the command-line... after the program has made it through a few thousand records. * The request to have evaluations be smart enough to reject certain stratification states is not implemented yet. There's still some work to do before I can replace VariantEval 2.0 with VariantEval 3.0, but feel free to take a look. I'd love comments on the new code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4946 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-06 15:20:24 +00:00
fromer	4b37710bcd	Added validator for phasing using read information, e.g., PacBio: ReadBasedPhasingValidationWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4940 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 20:05:56 +00:00
hanna	8d2c14b29c	Update Picard / sam-jdk at Tim's request. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-03 02:17:25 +00:00
hanna	cba18116e4	A significant refactoring of the ROD system, done largely to simplify the process of streaming/piping VCFs into the GATK. Notable changes: - Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build RMDTracks and lookup codecs. - RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class are the walkers that feel the need to access the ROD system directly. (We need to stamp out this access pattern. A few minor warts were introduced as part of this process, labeled with TODOs. These'll be fixed as part of the VCF streaming project. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 04:52:22 +00:00
ebanks	dabdeb729e	Eric broke the build. Eric broke the build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4847 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 17:01:38 +00:00
ebanks	5c0b66cb7c	3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 16:24:28 +00:00
ebanks	d89e17ec8c	Fare thee well, UGv1. Here come the days UGv2. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-29 21:51:19 +00:00
ebanks	222cd42ceb	Have the UG engine take care of the GL to PL conversion. Note that we still use GLs for calling (since we are losing precision in high-pass and, even worse, it can affect QD), but we emit PLs in all cases. This means that calculating the GLs, emitting them to VCF, and then calling off of them (a la samtools) is absolutely, positively not ideal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4745 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-29 20:28:16 +00:00
ebanks	102c8b1f59	Large refactoring of the UGv2 engine so that it is now truly separated into 2 distict phases: GL calculation and AF calculation, where each can be done independently. This is not yet enabled in UGv2 itself though because I need to work out one last issue or two. Tested on 1Mb of 1000G Aug allPops low-pass and results are identical as before. Also, making BQ capping by MQ mandatory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4744 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-28 21:36:33 +00:00
ebanks	35b90d2295	Don't compute SB for ref calls git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4735 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-26 03:54:26 +00:00
ebanks	ea6e2218c1	1. dbsnp has some massive indels which my left-aligner was barfing on because there isn't enough reference context; fixed. 2. Lower default calling threshold to Q30 for UGv2. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4722 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-23 19:28:33 +00:00
bthomas	374c0deba2	Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-19 19:59:05 +00:00
kshakir	c723db1f4b	Added a -summary jexl argument to VariantEval similar to -validate. Updated the package of ValidationGenotyper to match the file location. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4710 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-19 04:42:46 +00:00
rpoplin	b677080858	Initial checkin of the ValidationGenotyper. Not intended to be used by anybody yet. Only here for archival purposes at this point. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4685 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 22:33:49 +00:00
depristo	ef2f6d90d2	VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 22:19:22 +00:00
ebanks	28142408ff	Refactoring so that all counting in UGv2 is done on the filtered context. In particular, tests for empty pileups and too many spanning deletions now use the correct counts. Also, -all_bases mode now trumps all; this one is for you, chartl. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4671 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 05:01:12 +00:00
delangel	cb1e8ad43a	Temp bug fix for indel genotyper: if there are two or more variant contexts at a site, just choose the first one containing an indel and genotype that. There might be cases where IGv2 emits 2 indel variant contexts in at the same ref location which made us fail there. A better solution will be to form underlying haplotypes supported by reads and compute likelihoods of that. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4667 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-14 00:21:54 +00:00
ebanks	69de3e51bf	Better precision for the calculated AF value. Now looks at the total number of samples to determine how much precision is necessary. Also, changing default min BQ used for calling in UGv2 to Q17. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4655 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 08:31:40 +00:00
ebanks	2f6666a988	Correcting traversal statistics git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4652 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-11 22:46:58 +00:00
delangel	2f3be24a00	Improvement in exact allele frequency calculation model (still under test, but this is definitely better than what I had before). Instead of approximating log(10^x+10^y) as max(x,y), approximate full Jacobian formula max(x,y)+log(1+10^-abs(x-y)) with static lookup table for the second term. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4647 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-11 01:22:35 +00:00
hanna	8e36a07bea	Convert GenomeLocParser into an instance variable. This change is required for anything that needs to be simultaneously aware of multiple references, eg Queue's interval sharding code, liftover support, distributed GATK etc. GenomeLocParser instances must now be used to create/parse GenomeLocs. GenomeLocParser instances are available in walkers by calling either -getToolkit().getGenomeLocParser() or -refContext.getGenomeLocParser() This is an intermediate change; GenomeLocParser will eventually be merged with the reference, but we're not clear exactly how to do that yet. This will become clearer when contig aliasing is implemented. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-10 17:59:50 +00:00
ebanks	e05af54f3e	Found the cause of 80% of our non-called FNs: an excess of filtered bases were causing us to choose the wrong alternate allele. More details to dev team. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4634 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-07 03:39:57 +00:00
ebanks	4e109f58bf	In preparation for Ryan's jumping into SLOD: getting rid of bad hack to ensure P(AF=i) is calculated in the strand-specific cases. With Mark's recent changes this is no longer necessary and just makes the code slower. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4620 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-03 03:44:59 +00:00
depristo	23cb399a88	Reasonable first pass at a correct SB calculation. Simple utilities to support it. VariantsToTable no longer prints filtered sites by default. New non-standard variant eval module to print comp sites not present in eval (FN finder) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4601 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-31 12:41:52 +00:00
delangel	30fae5cf18	Major redo of exact AF computation for UnifiedGenotyperV2. Fact of life is, there's no way we can compute an exact QUAL field and keep performing the AF computation in linear probability space. In good sites with lots of samples, the ratio of Pr(AC=K*\|D) to Pr(AC=0\|D) can be 10^1500 or some ridiculous large number like that, which no double can represent. So, we abandon probablity space and work now in log likelihood space, which has several major repercussions: a) Sites were numerically well behaved now, but another hard fact of life is that the AF iteration is defined in linear Pr space, not in log likelihood space, and the math doesn't work out in log space. So, we need to convert back and forth from lin to log space. b) As a consequence of a), the code got a major slowdown, and calling the 629 samples was about 15 times slower than before (sic). c) To solve b), log10 of integers are now cached at init, and numerical approximations are now made. Most importantly, I'm using the approximation that log(exp(a) + exp(b)) ~= max(a,b) which seems almost inconsequential in practical performance but reduces computation time to what it was before. More detailes analyses are forthcoming. This approximation can be refined further on to avoid expensive log-exp conversions if further profiling and analysis deems it necessary. Also, two other issues were solved: a) Strand bias computation was actually wrong in the case where the optimal AC was bigger than max(forward reads,reverse reads). Now the code is exactly as buggy as the grid search model (all bugs are equal, but some are more equal than others) b) Genotype likelihoods are now computed in a better way and if a likelihood < 0 we don't just cap to 0 but do something a bit smarter. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4600 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-31 01:26:04 +00:00
depristo	860de05a7c	Bug fix for PL vs. GL in header. PL now truly default output for UGv2 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4592 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 12:39:18 +00:00
depristo	9782dde3dd	Bug fix for PL vs. GL in header. PL now truly default output for UGv2 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4591 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 12:38:48 +00:00
depristo	cbce3e3c83	General support for both GL (log10) and PL (phred-scaled) genotype likelihoods. All walkers now use the Tribble GenotypeLikelihoods object for parsing VCFs with genotype likelihood fields. Please use GenotypeLikelihoods object from now on for seamless support for GL and PL tags. UGv2 now uses PL by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4589 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 01:48:47 +00:00
delangel	e24f7fec47	Fixed indel genotyper which broke yet again because we can't just call context.getBasePileup() without checking again for its existence in the first place. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4553 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-22 15:17:11 +00:00
ebanks	181f901126	Fix for Ryan: don't pull reference sequence for the portions of reads that extend beyond the contig boundaries git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4551 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-22 14:38:26 +00:00
ebanks	225cf49128	Implementing reference confidence estimate in UGv2 as per UGv1 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4542 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 16:57:59 +00:00
delangel	cf9c9ae241	Three important updates for Dindel genotyper: a) Fix it up because it broke with a recent checkin to annotate vcf with unfiltered depth. b) Printout of ref/alt alleles in output vcf was incorrect because the start/stop positions of associated GenomeLoc were incorrectly computed in case of a deletion. c) Redid Beagle input/output walkers as not assume that ref was a single base, not to assume that variant was a vcf and generalized it to be indel-capable, so now the Beagle walkers can be used for indels as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4541 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 16:00:16 +00:00
ebanks	91049269c2	Optimizations across the board, with help from Guillermo, Matt, and JProfiler. Too tired to give details now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4535 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 20:47:41 +00:00
ebanks	524cb8257c	Renaming for accuracy git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4519 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:11:07 +00:00
ebanks	0fe504b748	Use filtered depth for Exact model (just like grid search) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4518 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:08:31 +00:00
ebanks	d54d9880d7	Now that G's new genotyping algorithm is live, I've cleaned up the code to completely separate the grid search from the exact model. AlleleFrequencyCalculationModel is now completely abstract. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4517 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:04:06 +00:00
ebanks	80e5ac65b4	CAP_BASE_QUALITY needs to be included in the clone() method for it to be usable in UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4516 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 03:11:03 +00:00
ebanks	f78ff08e2b	This is less correct than my previous change but it's what UGv1 does and now is not the right time to start mucking with things. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4506 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 18:56:45 +00:00
ebanks	471c18054f	Fix for SB calculation: the best overall AF might not have any mass when just looking at reads from a single strand. We need to compute the best AF for each stratification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4505 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 17:51:18 +00:00
ebanks	d41c252b13	Looking over the calling results with Ryan, it's clear that while the grid search optimization (ignoring samples that are clearly ref) can work for assigning genotypes, it cannot be used for calculating P(AF>0). There's too much area under the likelihood curve that gets lost and the QUALs are negatively affected. However, testing showed that this only slightly affects runtime (~15 minutes per 1Mbase for the 1kg allpops). The optimization does remain for genotyping. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4498 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 19:06:32 +00:00
ebanks	cfb33d8e12	Filtering optimizations are now live for UGv2. Instead of re-computing filtered bases at every locus, they are computed just once per read and stored in the read itself. Eyeballing the results on the ~600 sample set from 1kg, we cut out ~40% of the runtime! QUALs are now sometimes different from UGv1 because I noticed a bug in v1 where samples with spanning deletions only were assigned ref calls instead of no-calls which ever so slightly affects the QUAL. Not a big deal though. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4494 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 05:04:28 +00:00
ebanks	fd8351cd49	Get rid of useless test/'optimization' that was carried over from UGv1. New codde is (minimally) faster with same results. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4478 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-11 04:04:07 +00:00
ebanks	f28523e7de	Implemented SB for UGv2. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4477 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-11 03:56:01 +00:00
asivache	39e373af6e	deleting accidentally committed junk git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4464 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 15:13:01 +00:00
delangel	3838823262	Two ugly hopefully temporary fixes for new genotyping model: a) In Indel genotyper: we can't deal yet with extended events correctly and we are still triggering at each extended event which results in repeated records on a vcf. So, to avoid this, keep track of start position of candidate variantes we've visited and if we've visited a variant before we don't do it again. b) Avoid infinite terms in QUAL and in genotype likelihoods which can happen if posterior AF happens to be exactly zero. For now, hard-code a minimum value of each term of the posterior AF likelihood to be -300 (ie 1e-300 in lin space). This can be solved with better and smarter log-to-lin conversions and some precision fixes in AF calculation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4455 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 00:53:16 +00:00
ebanks	3c5dc675ab	For Guillermo: only decide that something is a clear reference call if it is at least 10 times as likely as the next best genotype git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4441 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-06 15:16:41 +00:00
ebanks	3d564f4a29	reverting an accidental change from the dindel merge git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4434 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-06 03:08:09 +00:00
ebanks	b5e148140b	Officially fixed the UG priors; updated the default min MQ/BQs to pipeline values of q20 and min calling threshold to Q50 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4431 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-05 18:35:36 +00:00
delangel	d4398f2686	silly bug fix: if I'm to do a short term hack to avoid -infinity likelihoods I might as well do it right. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4403 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-01 18:39:45 +00:00
delangel	e920badcc4	Temporary fix for case where genotype likelihoods are exactly (1,0,0) or (0,1,0) etc. at a site with new indel genotyper: this would make us blow up when converting to log space and try to assign genotypes at a site. A more robust solution is in the works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4401 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-01 17:43:43 +00:00
delangel	fa9c21c020	More fixes for exact AF calculation model in new unified genotyper: a) Fixed bugs in new dynamic programming-based genotyper b) Fixed up temp hack that handles extended pileups for now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4398 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-01 02:32:50 +00:00
delangel	eb67aee732	bug fix: forgot to uncomment code to compute genotype likelihoods git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4397 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-30 21:38:22 +00:00
delangel	ece694d0af	Next iteration on new UG framework: - Brought over exact AF estimation from branch (which is now dead). Exact model is default in UnifiedGenotyperV2. - Implemented completely new genotyping algorithm given best AF estimate using dynamic programming, which in theory should be better than both greedy search and any HWE-based genotyper. - Integrated and added new Dindel likelihood estimation model. - Corrected annotators that would call readBasePileup: since we can be annotating extended events, best way is to interrogate context for kind of pileup and either readBasePileup or readExtendedEventPileup. All changes above except last one are still in playground since they require more testing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4396 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-30 21:33:59 +00:00
ebanks	2d1265771f	Fix for G: make sure to generate the genotype conformations in the grid for the target frequency when not using grid search for anything except the conformations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4382 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 16:44:53 +00:00
delangel	4556e3b273	First iteration in filling up exact AF calculation with new refactored UG. Code computes EM iterations of exact AF spectrum and returns to caller. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4381 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 16:21:54 +00:00
ebanks	0d71dff928	Small bug fix to the new UG (need to initialize the entire posteriors array) means that we also get identical results as old UG when calling with 60 samples in the pilot1 data. Now that I'm happier with UGv2, I've transitioned it to use the correct AF priors instead of the busted ones still in the old UG. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4379 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 14:24:50 +00:00
ebanks	0ec07ad99a	Initial version of refactored Unified Genotyper. Using SNP genotype likelihoods and GRID_SEARCH AF estimation models, achieves the exact same results as original UG on 1-2 samples with the exception of strand bias (not implemented yet); other than that I have no idea. Needs tons more testing. Do not use. For Guillermo only. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4377 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 08:42:25 +00:00
fromer	720aaca8a0	Trying to restore SVN history for phasing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4372 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 23:50:28 +00:00
fromer	dfb5143a41	Restore folder git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4370 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 23:46:07 +00:00
fromer	7c909bef82	Moved phasing classes out of playground! The code is still under production, though... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4369 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 23:21:28 +00:00
fromer	8d8980e8eb	Fixed phasing algorithm to: 1. More correctly weed out irrelevant reads and sites; 2. Crudely flag sites with large phase discrepancies betweens reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4368 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 23:02:53 +00:00
kshakir	edaa278edd	Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance. This will allow other programs like Queue to reuse the functionality. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-25 02:49:30 +00:00
hanna	497bcbcbb7	Recent changes to the build system make the build system complain loudly about pieces of core that depend on playground. Most of these have been eliminated by (temporarily) promoting Aaron's report system to core in this checkin. I'll follow up with other changes in separately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-24 22:09:12 +00:00

1 2 3 4 5 ...

1363 Commits (d534241f35505d757f39d482d45d3316d6dd944e)