gatk-3.8

Commit Graph

Author	SHA1	Message	Date
delangel	0aef5c0074	Totaly experimental, possibly useless annotation that logs # of MQ0 reads / total depth, TBD if VQSR can use it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5905 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-30 14:05:39 +00:00
kiran	b4d379584c	Commented out the generation of the GATKReport that I was using for debugging. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5903 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:15:09 +00:00
kiran	2a9c75c5ba	Throw an exception if the programmer tries to access a column that doesn't exist. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5902 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:08:48 +00:00
kiran	f3b38c0d3e	Fixed a bug in my math where I assumed the genotype likelihoods were normalized to 1.0 when they in fact are not. Now genotypes get altered when a different genotype configuration leads to a more consistent answer with regards to inheritance constraints. There's the question of what to do when two configurations are almost equally likely - I should probably filter those events out. But currently there is no threshold on the transmission probability. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5901 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:08:05 +00:00
carneiro	5974675b43	Two intermediate commits, to work over the weekend. ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model. dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:03:08 +00:00
carneiro	69d9b5989f	documenting this walker as it may be useful to others in the future. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5899 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 21:58:51 +00:00
droazen	a50c40ed05	Temporary commit to aid in investigation of recent intermittent IndelRealignerIntegrationTest failures -- yes, it's the classic printf() debugging technique. Will revert in a day or two once I get the data I need :) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5896 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 20:01:57 +00:00
rpoplin	2227f49220	misc cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5893 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 16:49:20 +00:00
rpoplin	9e834391fe	We now skip over all covering RODs in the BQSR as intended instead of just those which can be converted into a VariantContext. All the integration tests change because of subtleties in how certain dbsnp rod records are being converted into VCs. Added integration test which uses a bed file as the list of known polymorphic sites. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5892 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 16:32:17 +00:00
depristo	8ed82e5a08	The previous version of the UG was always creating BAQ'd pileups for the underlying site QUAL calculation. This resulted in some slowdown in the code. But as far as I can tell, the code actually didn't apply the BAQ'd base quality anywhere when the BAQ field wasn't in the read, so this just saves us 20% of the runtime when BAQ isn't enabled from heading into the BAQ subsystem when we don't actually want to get the BAQ'd base qualities. Fixed minor problem with WalkerTest for "" (for parameterization) md5s. Added an explicit integrationtest for BAQ NONE Now only creates the BAQ'd pileup, if the useBAQPileup parameter is provide in initializeAlternateAllele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5891 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 14:00:52 +00:00
depristo	136c8c7900	ClipReads now supports HARDCLIP_BASES, though in fact this turned out to be not necessary for my desired tests. In the process of developing the HARDCLIP mode, I added some proper ReadUtils unit tests, which would ideally be expanded to include other ReadUtil functions, as added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5890 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 11:42:22 +00:00
hanna	a77ca2d36a	Incorporating Guillermo's patch to eliminate compile-time dependency of (core) UG indel model on oneoffs. Thanks Guillermo! We'll polish the patch when you free up a bit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5888 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 02:22:19 +00:00
delangel	6ecbfa9013	OK, this time REALLY fix cut and paste error git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5880 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-26 19:47:12 +00:00
delangel	efe6602827	Fix copy-paste error from previous commit git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5878 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-26 16:02:08 +00:00
delangel	7a43673599	Bug fix: also enclose fetching FS or HRun in a try/catch block or else code will blow up if an annotation is absent (e.g. when there no evidence for a variant in a vc) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5877 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-26 15:00:36 +00:00
delangel	f7298f4a7f	First of many baby steps to redo way in which we trigger events for indel calling and to eliminate extended events: get rid of SpanningDeletions annotation for indels. It's completely useless, and even more so once we no longer trigger at extended events (because we'll trigger by definition a base before a deletion starts, so deletions present in the current pileup are not informative). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5876 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-26 00:49:23 +00:00
ebanks	bafdd4f8f7	Ask for existance of extended pileup before grabbing it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5874 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-25 17:39:03 +00:00
ebanks	6ed71cf683	Annotation that adds a list of samples who are polymorphic at a site based on the GTs. Very useful if you are looking at rare variants among many samples, esp. in Evoker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5868 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 20:12:27 +00:00
depristo	1bd1404aa9	Sometimes md5s can be null git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5867 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 19:17:18 +00:00
depristo	e582a92af6	WalkerTest now checks for valid md5s in the integrationtests themselves, so no more stray whitespace errors. Added a WalkerTestTest to ensure tha t bad MD5s are detected and an error thrown git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5865 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 14:34:55 +00:00
hanna	06486c134a	Kill extra space in the md5. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5863 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 12:00:31 +00:00
depristo	57e4693e4c	Slightly better error message when failing to create the index on the fly git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5861 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 11:04:08 +00:00
depristo	cf3dbfee97	Renamed variantMergeOptions to filteredRecordsMergeType, as this is really what it does. Cleaned up the wiki so that it's clear what this does, as well as included an example of how to create an intersection with CombineVariants and SelectVariants. Added integrationtests of CombineVariants with OMNI and HapMap that deal with the two ways to merge fitlered/unfiltered records at the same site. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5860 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 01:54:29 +00:00
kiran	653475ce12	Now finds the most likely configuration of genotypes given the genotype likelihoods and inheritance constraints. The parental genotypes are now phased as well (the alleles are ordered as A_transmitted\|A_untransmitted). Rewrote the way the transmission probability is calculated. This will probably move into core soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5859 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-24 01:35:40 +00:00
hanna	4bfec4c55b	Reenabling E.coli ValidatingPileup with MV1994 realigned using the BWA/C bindings. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5856 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 21:32:53 +00:00
chartl	c7f4674fe2	Great! Contracts is working. Fixing some misspecified ones. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5854 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 21:00:52 +00:00
hanna	5dca1e4d2e	Make IntervalIntegrationTest aware of the new alignments in the MV1994.bam testset. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5852 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 19:59:47 +00:00
chartl	7ff5375493	Removing build-killing dependency on a private package. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5851 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 18:13:15 +00:00
chartl	0b07373909	Incorporating old feedback from eric: @deprecated methods should not be @deprecated, but rather protected, and the test's package moved to where it can access those test methods. Also allows for the slightly more awesome name "MWUnitTest" git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 18:06:05 +00:00
kiran	f8f37a786d	Now emits much more informative filter names and includes all of other the proper VCF header details (filter description line, tag definitions, etc.). Currently rewriting the way the transmission probability is calculated. This is shaping up to be a lovely little piece of code... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5849 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 17:50:59 +00:00
chartl	15dc632570	The U-value can be zero (edge case) z-value can not be NaN (and can't possibly be null) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5847 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 14:15:36 +00:00
chartl	3c31007da4	Stupid brackets. How did this even compile? git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5846 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 14:00:53 +00:00
chartl	480859db50	Contractified version of MannWhitneyU. Some behavior has been changed: - Running a test when there are no observations of at least one of the sets now breaks the MWU contract + MWU returns Pair(Double.NaN,Double.NaN) in these instances to maintain the contract of never returning null + No more Double.Infinity values will appear - RankSumTests now probe the return values for NaNs, and don't annotate if they appear - For small sets where the probability is calculated recursively, the z-value is now the inversion of the error function and not the approximate z-value - UG and Annotator integration tests updated to reflect changes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5845 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 13:57:15 +00:00
depristo	b814f4bbd6	Contracts for HasGenomeLocation. BAQ iterator variables are all final. Contracts added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5844 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 02:21:59 +00:00
depristo	43057bd15c	Remove Param annotation and associated broken processing code, as this was never used in the codebase git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5843 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 02:21:15 +00:00
depristo	d005c4bf09	GenomeLocProcessingTracker was using SimpleTimer in a non-thread safe way. No longer providing an interface to time parallel operations. Now issues warning if someone enables distributed GATK, as this is considered an unstable, experimental engine feature. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5842 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 02:10:27 +00:00
depristo	a18b0152df	Contracts for SimpleTimer, as well as UnitTests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5841 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 19:45:31 +00:00
depristo	0dc0d586f1	Phasing-specific utilies are now in the Phasing walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5839 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 18:51:35 +00:00
depristo	f608ed6d5a	Removed old (and unused) reporting system, now that Kiran's VE reporting system is working. Refactors dictionary creation error messages into UserExceptions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5836 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 18:42:52 +00:00
rpoplin	4e7ecbdcb2	FS values need to be jittered just like HRun git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5835 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 16:44:12 +00:00
depristo	9cc049f80f	Contracted ReferenceContext. Removed depreciated accessors that aren't used in the GATK at all git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5834 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 02:41:15 +00:00
depristo	d77f4ebe31	CalibrateGenotypeLikelihoods now emits a molten data set with REF and ALT alleles, so that GL calibration can be evaluated as a function of the REF/ALT bases. DigestTable is a stand-alone Rscript that digests the multi-GB molten data table into a tiny table that shows reported vs. empirical GLs, as a function of a variety of features of the data, like REF/ALT, comp GT, eval GT, and GL itself. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5833 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 14:02:30 +00:00
depristo	6a49e8df34	Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below. Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost. For example, when selecting down NA12878 from the OMNI, I previously received the following VCF: 1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site. My change results in a VCF that looks like: 1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 The genotype remains unchanged, but the ALT allele is now preserved. I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant(). isVariant => is there an ALT allele? isPolymorphic => is some sample non-ref in the samples? In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this. Wiki docs updated. Does anyone have critical infrastructure that depends on the previous convention? Let me know so we can coordinate the change. There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 13:59:16 +00:00
depristo	8377424089	Basic error checking to ensure incoming arguments are provided correctly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5831 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 13:43:48 +00:00
depristo	e234589240	Contracts for GenomeLocParser and GenomeLoc are now fully implemented. GenomeLocs can officially have any start/stop values from -Inf - +Inf. Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser. General code cleanup throughout the subsystem. All validation code for GLs is now centralized, and all I/O systems now validate their inputs. Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals. Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20). Many interval processing routines were moved to IntervalUtils, as this is their natural home. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 02:01:59 +00:00
kiran	3aa56037af	If asked, filters out triple-het situations too (which cannot be simply phased by transmission). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5829 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-20 18:48:19 +00:00
depristo	e16bc2cbd9	Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this. Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone() Removed misc. unnecessary imports Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-20 15:43:27 +00:00
kiran	d896a4a9d3	Given genotypes for a trio, phases child by transmission. Computes probability that the determined phase is correct given that the genotypes for mom and dad are correct (useful if you want to use this to compare phasing accuracy, but want to break that comparison down by phasing confidence in the truth set). Optionally filters out sites where the phasing is indeterminate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5824 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-19 21:27:37 +00:00
rpoplin	fe4b40ac2c	Adding new InbreedingCoeff and PercentNBases annotations for Guillermo to use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5823 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-19 19:50:39 +00:00
ebanks	bc98ac1e74	Adding a TODO for future consideration git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5821 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-19 15:02:23 +00:00

1 2 3 4 5 ...

4647 Commits (0aef5c0074318610f0bbbb64cd93751ad02440b6)