gatk-3.8

Commit Graph

Author	SHA1	Message	Date
aaron	9076c0b28b	removing unused code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3958 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-06 14:24:39 +00:00
ebanks	341e752c6c	1) AlleleBalance is no longer a standard annotation, but the Allelic Depth (AD) is for each sample. 2) Small fixes in the VCFWriter: a) Trailing missing values weren't being removed if their count was > 1 (e.g. ".,.") b) We were handling key values that were Lists, but not Arrays. We now handle both. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3956 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-06 12:05:14 +00:00
aaron	72ae81c6de	VariantContext has now moved over to Tribble, and the VCF4 parser is now the only VCF parser in town. Other changes include: - Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from inside the tribble directory. - Hapmap ROD now in Tribble; all mentions have been switched over. - VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc. - VariantContext.getSNPSubstitutionType is now in VariantContextUtils. - This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN I'll send out an email to GSAMembers with some more details. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-05 18:47:53 +00:00
rpoplin	a8d37da10b	Checking in everyone's changes to the variant recalibrator. We now calculate the variant quality score as a LOD score between the true and false hypothesis. Allele Count prior is changed to be (1 - 0.5^ac). Known prior breaks out HapMap sites git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3952 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-05 14:12:19 +00:00
ebanks	07addf1187	Fix for Kiran: since the Variant Annotator will re-annotate on top of existing annotations it makes sense to remove old headers if they conflict with the definitions being added by VA. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3951 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-05 06:44:39 +00:00
ebanks	227c4b10f0	Bug fix for Chris: convert comp tracks to VC so that we can respect the filter field. Added an integration test to cover this. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3949 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-05 04:13:16 +00:00
asivache	d53d5ffbf6	A utility class that computes running average and standard deviation for a stream of numbers it is being fed with. Updates mean/stddev on the fly and does not cache the observations, so it uses no memory and also should be stable against overflow/loss of precision. Simple unit test is also provided (does not stress-test the engine with millions of numbers though). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3944 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 21:39:02 +00:00
ebanks	8d8acc9fae	Moving G's MyHapScore to replace the old HapScore git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3943 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 21:00:54 +00:00
ebanks	340bd0e2c1	Removed hard-coded pointers to references git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 17:59:37 +00:00
ebanks	2307bed742	VariantEval now uses the "standard" modules only by default. You can add other modules with the -E argument and not use all of the standard ones with -noStandard (they can be added back individually with -E). Generalized some of the packaging code from VariantAnnotator. Matt might want to take a look to make this nicer...? git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3925 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-03 16:51:10 +00:00
delangel	5af986e0c1	Add an integration test for Beagle (one for ProduceBeagleInput and one for BeagleOutputToVCFWalker) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3897 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-29 18:49:22 +00:00
ebanks	7dd55fbf13	Archiving git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3882 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-27 02:47:18 +00:00
depristo	19ad44d332	Minor improvements to CombineVariants to handle the complex case from Chris. IntegrationTest of complex case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3876 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-25 13:46:11 +00:00
depristo	e21376219d	Updates to CombineVariants for Tim. -setKey can be null. Integrationtests for -setKey foo and -setKey null. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3870 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 22:35:52 +00:00
delangel	26bb1cd9ce	Fix broken test correctly git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3869 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 20:47:41 +00:00
delangel	4fc1db7aaf	Change interface to VCFWriter add() method to take only 1 byte from reference (since that's the only thing it needs), to prevent bugs like having people call it with ref.addBases() which is wrong (since it provides bases starting from the left of reference context window). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3868 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 20:24:03 +00:00
delangel	5eef15cfdf	a) Bad bug fix to CombineVariants: when indels were being merged, the reference base provided was wrong - ref.getBases()[0] was being used, but this returns bease at start of window. Instead, the reference at current locus should be used. b) Cosmetic change to Beagle annotation description. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 15:13:47 +00:00
depristo	536399eaa0	Improvements to variant combine. Now calculates AC/AN/AF correctly by calling into the VariantAnnotator engine. Automatically removes annotations that are inconsistent across incoming VCs (in simpleMerge). TODO bug fix for Guillermo/Eric. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3858 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 13:33:11 +00:00
aaron	9579aace1f	updates to code dependent on Tribble, as well as the following Tribble changes: - makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it) - removed some system.out debugging code - fixed version checking in interval tree - made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file) - index creators now read the file before creating the index - changed the Index.write() method to take a LEDataStream instead of a file - removed the sequence dictionary code on the header - added utils for getting LEDataStreams - added a base Tribble exception git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 01:56:10 +00:00
delangel	98caedb5f0	Forgot to update VCF4 unit test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3853 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-22 16:25:51 +00:00
delangel	473ec91633	a) Bug fix in VCFHeader parsing - Info fields were not being parsed properly, with the result that the Count field was not being properly displayed in records (e.g. if Count=0 for a particular field, the INFO tag was still being displayed as ...;Field=x;... instead of ...;Field;... b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output. Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's. Pending issues: - PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested. - Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-22 02:36:45 +00:00
aaron	1cba81c16f	updates to tribble with fixes for some bugs I've found in some new indexing code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3842 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 22:08:04 +00:00
ebanks	ff6748d1cd	oops - missed one git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3841 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 18:55:19 +00:00
ebanks	c6ad26e04f	1) When quals/GQs are really integers (x.00), strip off the floating points. 2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output. 3) No more "ID" hard-coded all over the code to set the VariantContext ID. Use a static variable instead. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 18:01:45 +00:00
ebanks	0db7fab1a9	Fixing genotype filtering for VF and adding integration tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3839 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 07:30:21 +00:00
aaron	2a6c2d3098	re-enable test; I was moving the input file in prep for my last commit around on Eric, so he rightfully removed the test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3838 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 07:14:59 +00:00
aaron	0108517b98	updating the Tribble track loading code to use the new shared locks, updated lots of new tests, add infrastructure for the TreeInterval, and removed the old locking class. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3837 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 07:08:10 +00:00
ebanks	f742980864	1. Refactoring of GenoypeWriters so that parallelization now works again with VCF4.0. We now have just a single reference to the old VCF classes, and that one will be purged soon. 2. Moved Jared's VCFTool code into archive so that everything would compile. 3. Added the vcf reference base (needed for indels) as an attribute to the VariantContext from the reader. 4. TribbleRMDTrackBuilderUnitTest was complaining that a validation file didn'r exist, so I commented it out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3835 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 06:16:45 +00:00
depristo	70b07206a2	CombineVariants tests for Guillermo and Eric to explore the correctness of the in/out reader, writer behavior of the system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3834 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-19 22:41:48 +00:00
depristo	c47a5ff5ab	Official parallel CountCovariates, passes all integration tests. Now poster-child example of parallelism in GATK (Matt H). Apparent general performance improvements throughout too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3833 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-19 22:13:18 +00:00
rpoplin	8e31c01680	Solid processing in base quality recalibrator now has several options for how to handle no calls in the color space. --ignore_nocall_colorspace is removed and replace by --solid_nocall_strategy. Fixed some of the @Deprecated tags in BaseUtils. LocusWalkers now filter out FailsVendorQualityCheck reads. HLA caller integration test bam file had bad vendor reads so its integration test changed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3831 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-19 19:10:29 +00:00
aaron	f4cfb0f990	The first step in integrating Jim's tree based index scheme: - changed to a better method for getting headers from Codecs - some removal of old commented out code in the GATKAgrumentCollection - changes for the rename of FeatureReader to FeatureSource - removed the old Beagle ROD - cleaned up some of the code in SampleUtils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-19 04:49:27 +00:00
ebanks	5a1a3fc79a	Fix bad VariantContext creation in unit test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3824 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-18 20:21:01 +00:00
ebanks	693672a461	Refactoring the VCF writer code; now no longer uses VCFRecord or any of its related classes, instead writing directly to the writer. Integration tests pass, but some are actually broken and will be fixed this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3822 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-18 13:19:56 +00:00
ebanks	379584f1bf	Re-enable (most of) these tests. Guillermo will re-enable the other one when the VCF->VC conversion is done for indels git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3821 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-18 03:24:28 +00:00
delangel	55b756f1cc	First step in major cleanup/redo of VCF functionality. Specifically, now: a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted. c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format. d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved. e) Several VCF 4.0 writer bugs are now solved. f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0. g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension. Pending issues: a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes. b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 22:49:16 +00:00
aaron	36ac73cf9a	comment out broken test until it can be fixed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3810 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 20:04:40 +00:00
hanna	96034aee0e	Cleanup for Steve Hershman's issue. In the midst of doing this, I discovered that the semantics for which reads are in an extended event pileup are not clear at this point. Eric and I have planned a future clarification for this and the two of us will discuss who will implement this clarification and when it'll happen. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3809 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 18:57:58 +00:00
aaron	ec94cfdf05	remove unit test for VCF writer, it's not applicable now that we produce only VCF4. Guillermo, it's up to you if you want to adapt this or remove it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3803 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 14:33:25 +00:00
depristo	b29eda83bb	Parallelized CountCovarites! percent_ref_called_var now a standard genotype concordance module (for validation!). Really much smarter merging of headers for combineVariants. VCF codecs now actually look at the file version and blow up if they are the wrong versions. setHeaderVersion() in VCFHeaderLine. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3802 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 14:10:18 +00:00
ebanks	e7e58d7129	The SAM spec has now officially reserved my new tags for original cigar and original alignment start... except that OS has been named OP ('original POS') git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3800 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 00:09:36 +00:00
ebanks	a4f8d70d8d	oops, forgot to update this integration test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3788 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 11:38:33 +00:00
ebanks	460283f6d2	No more manually converting VariantContexts to VCFRecords. You should be utilizing VCs and not VCFRecords. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3787 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 05:21:28 +00:00
ebanks	6b5c88d4d6	The GATK no longer writes vcf3.3; welcome to the world of vcf4.0. Needed to fix a few output bugs to get this to work, but it's looking great. Much more still to come. Guillermo: hopefully this doesn't break your local build too badly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3786 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 04:56:58 +00:00
ebanks	9a05e8143d	Move to 4.0 and away from VCFRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3780 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 15:54:54 +00:00
ebanks	7e7da75d27	Moving over to 4.0 and away from VCFRecord git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3778 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 14:07:10 +00:00
ebanks	d896d03554	Moving VF to vcf 4.0. Still need to fix genotype filters. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3777 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 11:39:51 +00:00
ebanks	76b3b39720	Technically, Mark broke this with his commit earlier. But since I had an outstanding broken test, I lose and have to fix this one too... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3776 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 03:58:38 +00:00
ebanks	1bef7dd170	git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3775 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 00:56:12 +00:00
ebanks	52c534a8f2	Updating to VCF 4.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3770 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 20:18:30 +00:00
ebanks	e50627a49e	1. Updated tests and added integration test for liftover code. 2. Updated liftover code (and scripts) to emit vcf 4.0 and no longer depend on VCFRecord. 3. Beagle walker now also emits vcf 4.0. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3767 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 17:58:18 +00:00
ebanks	221e01fb27	deleting/archiving as instructed git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3765 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 16:59:45 +00:00
ebanks	e75b3e13bd	updating unit test for previous fix git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3761 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 03:23:53 +00:00
ebanks	fb717fe128	First pass needed to remove old VCF code: moving all VCF-related constants into a single unified class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3759 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-11 07:19:16 +00:00
chartl	ea8fd506bf	Update to PickSequenomProbes: Option to ignore mask sites within X bp of a variant (very useful for indels where dbSNP entries near the indel are almost always false SNP calls). Also fixed an integration test where the variant site itself, being in dbSNP, was represented as [N/C] rather than [A/C]. Added integration test for 1bp no-mask window. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3753 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-09 04:03:19 +00:00
depristo	45fb614296	Fixes to VE for obscure bug, as well as disabled integration test for CombineVariants git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3749 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-09 00:13:07 +00:00
ebanks	6e6ad36523	reallow MNP events through git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3740 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-08 06:26:52 +00:00
ebanks	9a81f1d7ef	Fixed this tool for chartl so that it now properly handles deletions. Added deletion case to integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3737 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-08 04:45:59 +00:00
hanna	9fc05ac2ae	eagerDecode is now false. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3733 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-07 22:51:48 +00:00
ebanks	4bc3ad2194	Shame on me: UG was emitting negative QUALs (-0) in all_bases mode. Thanks, Matt. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3732 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-07 20:30:22 +00:00
ebanks	30714ec8d9	As per quick chat with Richard Durban, don't increase the mapping quality of realigned reads too much; for now, arbitrarily increase the MQ by 10. We need to figure out a better solution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3731 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-07 20:12:59 +00:00
aaron	86031f4034	part two: todo's in combine variants, fixes for InferredGeneticContext, and some other tests and clean-up. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3721 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-05 21:07:53 +00:00
ebanks	36edc60ccc	Connected UG to the new comp track annotation system in VA. Also, when emit confidence is lower than call confidence (so that we emit records filtered with LowQual), add a corresponding FILTER header field to the VCF so that the validator doesn't complain. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3720 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-05 13:04:24 +00:00
aaron	3347d1ca7c	part one of combining format and info header lines code into a single abstract class for Mark; plus some 'm' removals from access methods for Eric. Adding fixes for CombineVariants next. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3719 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-05 05:57:58 +00:00
weisburd	9ec393bfce	Updated md5 - vcf header line change git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3714 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-02 21:02:09 +00:00
depristo	61e2b2e39b	Nearly finalize merging capabilities for CombineVariants. Support for dealing with inconsistent indel alleles at loci. Improvements to Allele and removal of addAllele to MutableGenotype. We are close to being able to merge all of 1000 genomes -- snps and indels -- into a single combined vcf git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3710 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-02 13:32:33 +00:00
aaron	3093a20a55	fixing VCF header format and info fields so that they propery emit the unbounded count value correctly for vcf4 or vcf3. Eric we should update the vcf4 spec page to indicate format fields are allowed to use the unbounded count as well (if this is true). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3707 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 22:02:16 +00:00
rpoplin	255b036fb5	Variant Recalibrator MLE EM algorithm is moved over to variational Bayes EM in order to eliminate problems with singularities when clustering in higher than two dimensions. Because of this there is no longer a number of Gaussians parameter. Wiki will be updated shortly with new recommended command. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3704 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 18:51:07 +00:00
aaron	43ca595d15	VCF headers now can be set to a particular VCF version after creation, which converts the header lines to the appropriate encoding on output. Plus some clean-up of the code. Also commented out the Tribble index out-of-date tests, the timing seems to be troublesome from the farm. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3702 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 05:32:14 +00:00
hanna	4995950d04	IndexedFastaSequenceFile is now in Picard; transitioning to that implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 04:40:31 +00:00
ebanks	944dbb94ce	Refactored and generalized the database/comp annotations in VariantAnnotator. Now one can provide comp tracks as with VariantEval (e.g. compHapMap, comp1KG_CEU) and the INFO field will be annotated with the track name (without the 'comp') if the variant record overlaps a comp site (e.g. ...;1KG_CEU;...). This means that you can now pass 1kg calls to the Unified Genotyper and automatically have records annotated with their presence in 1kg. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3684 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-30 16:37:31 +00:00
ebanks	12c0de6170	Added ability to clean using only known indels. Added integration test for it. Fixed vcf->vc conversion for indels which was busted. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3678 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-30 01:20:56 +00:00
aaron	844cb2ed33	fixing a bug that Eric found with RODs for reads, where some records could be omitted. Sorry Eric! Also putting more tolerance into the timing on the tibble index tests (that check to make sure we're deleting out of date indexes, and not deleting perfectly good indexes). It seems that some of the farm nodes aren't great with a stopwatch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3674 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-29 21:38:55 +00:00
ebanks	baf9479c35	An addition for Sendu since he can't seem to tell when his CountCovariate jobs die in the middle of writing the CSVs. We now write an EOF marker at the end of the covariates table and look for it when reading in the file in TableRecalibrationWalker. By default, we warn the user if the EOF marker isn't present, but we exception out if the user provides the --fail_with_no_eof_marker option. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3670 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-29 18:50:07 +00:00
ebanks	4a451949ba	add parallel option to target creator for masking out reads with bad mates git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3663 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 22:13:25 +00:00
ebanks	6a23edd911	Fix performance tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3662 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 21:51:48 +00:00
aaron	62d22ff1aa	adding the original allele list to a variant context (as the annotation ORIGINAL_ALLELE_LIST), in the case where the set alleles are the result of clipping. Added tests for both cases. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3658 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 17:23:46 +00:00
ebanks	1292c96e29	The cleaner now adds the OC (original cigar) and OS (original alignment start) tags as appropriate to reads that get realigned; this feature can be turned off. Also, improved integration tests (sorry, Kiran!). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3657 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 16:46:47 +00:00
ebanks	bf5cbad04c	Make the target creator a rod walker (that allows reads) so that we can easily trigger the cleaner on only known indel sites. Adding an integration test to cover this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3651 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 13:28:37 +00:00
ebanks	8e848ccd84	SAMFileWriters can now write to /dev/null without throwing exceptions, so we can remove the try/catch blocks. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3648 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-27 03:59:10 +00:00
aaron	09ccdf83b2	fixing a broken test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3647 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-25 21:59:00 +00:00
aaron	5f8a3f95ef	The GT field once again reigns supreme (it must be the first genotype field). Thanks for the catch Eric. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3645 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-25 21:03:05 +00:00
aaron	b3edb7dc08	two fixes for the VCF 4 parser: - Allow the "GT" field in genotypes at any point in the genotype string (before we required they be the first key-value pair). - Fix a bug with the phasing value put into the VariantContext, thanks for the catch Guillermo! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3638 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-25 18:01:23 +00:00
weisburd	e15fe6858e	Disabling test - Will need to update big-tables soon.. will re-enable after updating md5 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3637 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-25 15:43:41 +00:00
aaron	682f9b46c6	Two fixes together: 1) Some improvements to the VCF4 parsing, including disabling validation. 2) Reimplemented RefSeq in the new Tribble-style rod system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3630 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-24 22:17:03 +00:00
aaron	62bc7651a8	fix for PSPW with DbSNP mask. Added an integration test for this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3628 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-24 19:31:32 +00:00
aaron	8a9b2f4256	removing the GLF ROD. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3624 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-23 22:51:45 +00:00
aaron	611d834092	a couple of VCF 4 improvements: -Validation of INFO and FORMAT fields. -Conversion to the the correct type for info fields (i.e. allele frequency is now stored as a float instead of a string). -Checks for CNV style alternate allele encodings( i.e. <INS:ME:L1>), right now we exception out. Maybe we should just warn the user? -Tests for the multiple-base polymorphism allele case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3622 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-23 20:21:43 +00:00
ebanks	b6bceb39b0	Fixing up output for performance tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3619 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-23 17:00:17 +00:00
hanna	003dd4de3e	Rev Picard with performance enhancements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3615 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 22:54:23 +00:00
aaron	0cafd3d642	clip VCF alleles for indels: only a single left base, and as many right bases as align before converting to variant context. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3614 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 22:42:38 +00:00
aaron	9872b65803	clip to the null allele on the reference string in VCF 4, instead of stopping to perserve one reference base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3613 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 20:52:19 +00:00
ebanks	b5df2705c9	-Remove Nway output option -Remove in-memory sorting -Default to name-sorting (although we allow coordinate sorting with the --sortInCoordinateOrderEvenThoughItIsHighlyUnsafe flag). Cleaner, faster code. Wiki has been updated (including how to use FixMateInformation.jar from Picard). More changes coming soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3612 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 20:31:55 +00:00
aaron	a6d3e4bd47	Add code to allow reference alleles with 'N' in VariantContext, but not in the alternate allele(s). Also more updates to the VCF 4 code (fixed parsing for files without genotypes). This check-in will temperarly break the build (I need to see if Bamboo is correctly returning the log file for the failed builds). Will be fixed once Bamboo starts building. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3609 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 18:26:37 +00:00
ebanks	824c2bbac0	Finishing previous checkin git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3608 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 17:21:38 +00:00
aaron	32f324a009	incremental changes to the VCF4 codec, including allele clipping down to the minimum reference allele; adding unit testing for certain aspects of the parsing. Not ready for prime-time yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3604 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 06:31:05 +00:00
bthomas	300a18b85f	Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually. Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:42:42 +00:00
hanna	c806ffba5f	Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations will not be as fast as they could be because the workflow is currently merge sam records (sharding) -> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext is updated to take advantage of the new functionality in ReadBackedPileup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 02:11:42 +00:00
depristo	57a13805da	GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-19 21:36:41 +00:00
kiran	8ff93f77e6	Added evaluation module to count functional classes (missense, nonsense, etc.). At the moment, it only understands Cancer's MAF annotations. Added integration test for the functional class counting. Added better description for VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3595 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 21:51:40 +00:00

1 2 3 4 5 ...

810 Commits (8683087756550dbfb9aa1ced04d6f0b2e7dbdb35)