Commit Graph

368 Commits (b21f90aee0678e550746e101c6d0629ca962bce6)

Author SHA1 Message Date
hanna f13d52e427 Attempt to determine whether underlying filesystem supports file locking and
disable on-the-fly dict and fai generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3938 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 19:28:27 +00:00
delangel ba1a330293 Corrected location and made more explicit the error message thrown if someone tries to read a VCF 3.3 file with indels, which is not supported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3901 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-29 20:02:47 +00:00
aaron b3fd145161 fix for a bug deep in the tribble indexing: if you had a single record in the first contig, the second contig's index blocks would point to the wrong file seek location, and you'd see no
features in that contig. Thanks to Mark for finding this.  I'm not rev'ing the index version (which would cause all indexes to be rebuilt), since this seems like a pretty rare edge case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3865 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 18:39:55 +00:00
aaron 9579aace1f updates to code dependent on Tribble, as well as the following Tribble changes:
- makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it)
- removed some system.out debugging code
- fixed version checking in interval tree 
- made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file)
- index creators now read the file before creating the index
- changed the Index.write() method to take a LEDataStream instead of a file
- removed the sequence dictionary code on the header
- added utils for getting LEDataStreams
- added a base Tribble exception




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 01:56:10 +00:00
ebanks c5325b03be 1) Removed hard-coded strings. Please let's use the fields defined in VCFConstants.
2) General code cleanup.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3856 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 01:49:47 +00:00
delangel 473ec91633 a) Bug fix in VCFHeader parsing - Info fields were not being parsed properly, with the result that the Count field was not being properly displayed in records (e.g. if Count=0 for a particular field, the INFO tag was still being displayed as ...;Field=x;... instead of ...;Field;...
b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output.
Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's.

Pending issues:
- PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested.
- Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 02:36:45 +00:00
aaron 1cba81c16f updates to tribble with fixes for some bugs I've found in some new indexing code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3842 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 22:08:04 +00:00
ebanks c6ad26e04f 1) When quals/GQs are really integers (x.00), strip off the floating points.
2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output.
3) No more "ID" hard-coded all over the code to set the VariantContext ID.  Use a static variable instead.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 18:01:45 +00:00
aaron 0108517b98 updating the Tribble track loading code to use the new shared locks, updated lots of new tests, add infrastructure for the TreeInterval, and removed the old locking class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3837 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 07:08:10 +00:00
ebanks f742980864 1. Refactoring of GenoypeWriters so that parallelization now works again with VCF4.0. We now have just a single reference to the old VCF classes, and that one will be purged soon.
2. Moved Jared's VCFTool code into archive so that everything would compile.
3. Added the vcf reference base (needed for indels) as an attribute to the VariantContext from the reader.
4. TribbleRMDTrackBuilderUnitTest was complaining that a validation file didn'r exist, so I commented it out.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3835 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 06:16:45 +00:00
aaron f4cfb0f990 The first step in integrating Jim's tree based index scheme:
- changed to a better method for getting headers from Codecs
- some removal of old commented out code in the GATKAgrumentCollection
- changes for the rename of FeatureReader to FeatureSource
- removed the old Beagle ROD
- cleaned up some of the code in SampleUtils

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 04:49:27 +00:00
ebanks 982947d328 update to deal with partial indels (I/D with no bases) in the HM records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3820 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 02:56:37 +00:00
depristo d40299840c indenting clean up
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3816 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-17 22:28:28 +00:00
delangel 55b756f1cc First step in major cleanup/redo of VCF functionality. Specifically, now:
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. 
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.

Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 22:49:16 +00:00
asivache 1dd8a28a5d Added new query: isMNP(feature); returns true if dbsnp feature is multi-nucleotide polymorfism (e.g. a di-nuc TA ->CC)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3806 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 15:32:10 +00:00
depristo b29eda83bb Parallelized CountCovarites! percent_ref_called_var now a standard genotype concordance module (for validation!). Really much smarter merging of headers for combineVariants. VCF codecs now actually look at the file version and blow up if they are the wrong versions. setHeaderVersion() in VCFHeaderLine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3802 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 14:10:18 +00:00
ebanks 460283f6d2 No more manually converting VariantContexts to VCFRecords. You should be utilizing VCs and not VCFRecords.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3787 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 05:21:28 +00:00
ebanks 6b5c88d4d6 The GATK no longer writes vcf3.3; welcome to the world of vcf4.0. Needed to fix a few output bugs to get this to work, but it's looking great. Much more still to come. Guillermo: hopefully this doesn't break your local build too badly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3786 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 04:56:58 +00:00
ebanks 52c534a8f2 Updating to VCF 4.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3770 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 20:18:30 +00:00
ebanks fb717fe128 First pass needed to remove old VCF code: moving all VCF-related constants into a single unified class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3759 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-11 07:19:16 +00:00
depristo 179067e3f4 Support for . values in qual field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3752 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 01:47:02 +00:00
aaron 12fecc8d8f remove the picard DbSNP ROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3743 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 17:46:00 +00:00
ebanks 6e6ad36523 reallow MNP events through
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3740 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 06:26:52 +00:00
ebanks ada8c9931f We were never clipping the VCF-provided ref base off the left end of the alleles for insertions, so the reference allele was never null (and downstream walkers would fail). Didn't this get tested with insertions at some point?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3738 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 05:24:27 +00:00
ebanks b7a3d1e61f Bug fix: if the FORMAT field consisted of just GT, we were exceptioning out. How did we not catch this until now?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3735 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 04:41:40 +00:00
depristo b934cc7554 Updates to fix some bugs in merger. Now able to merge into project wide indel VCF files. Integration teests coming tomorrow
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3727 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 03:16:33 +00:00
aaron 3347d1ca7c part one of combining format and info header lines code into a single abstract class for Mark; plus some 'm' removals from access methods for Eric. Adding fixes for CombineVariants next.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3719 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-05 05:57:58 +00:00
weisburd f7593435eb Implemented decodeLoc(..)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3713 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 21:01:36 +00:00
depristo cd2e4b0a1e merging now very close to working. Bug todo in writer and vcf infrastructure. Can almost create merged snp and indel files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3712 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 20:09:25 +00:00
aaron 43ca595d15 VCF headers now can be set to a particular VCF version after creation, which converts the header lines to the appropriate encoding on output. Plus some clean-up of the code.
Also commented out the Tribble index out-of-date tests, the timing seems to be troublesome from the farm.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3702 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 05:32:14 +00:00
depristo b8d6a95e7a Preliminary commit of new VCFCombine, soon to be called CombineVariants (next commit) that support merging any number of VCF files via a general VC merge routine that support prioritization and merging of samples! It's now possible to merge the pilot1/2/3 call sets into a single (monster) VCF taking genotypes from pilot2, then pilot3, then pilot1 as needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3690 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:13:03 +00:00
ebanks ae33d8a2f2 I just wanted one more vote. It's settled: we die.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3681 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 14:00:56 +00:00
ebanks 8fb37f5f7a For Kiran: warn the user when the actual and vcf ref bases differ so that if an exception is generated later, he knows why. All: should we generate the actual exception here? Is there any reason to allow cases where the vcf record has a different ref base than the actual reference? I'd vote that we die here. Thoughts?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3680 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 13:56:16 +00:00
ebanks 12c0de6170 Added ability to clean using only known indels. Added integration test for it. Fixed vcf->vc conversion for indels which was busted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3678 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 01:20:56 +00:00
aaron 62d22ff1aa adding the original allele list to a variant context (as the annotation ORIGINAL_ALLELE_LIST), in the case where the set alleles are the result of clipping. Added tests for both cases.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3658 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 17:23:46 +00:00
depristo d6cbe4d0ad Bug fixes to support haploid genotypes, optimization for indexing, now tracks the line of the VCF and catches errors to tell you the line no and line when a parsing error occurred.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3646 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 21:08:41 +00:00
aaron 5f8a3f95ef The GT field once again reigns supreme (it must be the first genotype field). Thanks for the catch Eric.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3645 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 21:03:05 +00:00
aaron b3edb7dc08 two fixes for the VCF 4 parser:
- Allow the "GT" field in genotypes at any point in the genotype string (before we required they be the first key-value pair).
- Fix a bug with the phasing value put into the VariantContext, thanks for the catch Guillermo!

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3638 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 18:01:23 +00:00
aaron 682f9b46c6 Two fixes together:
1) Some improvements to the VCF4 parsing, including disabling validation.
2) Reimplemented RefSeq in the new Tribble-style rod system.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3630 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-24 22:17:03 +00:00
aaron 8a9b2f4256 removing the GLF ROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3624 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 22:51:45 +00:00
aaron 611d834092 a couple of VCF 4 improvements:
-Validation of INFO and FORMAT fields.
-Conversion to the the correct type for info fields (i.e. allele frequency is now stored as a float instead of a string).
-Checks for CNV style alternate allele encodings( i.e. <INS:ME:L1>), right now we exception out.  Maybe we should just warn the user?
-Tests for the multiple-base polymorphism allele case.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3622 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 20:21:43 +00:00
aaron 0cafd3d642 clip VCF alleles for indels: only a single left base, and as many right bases as align before converting to variant context.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3614 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 22:42:38 +00:00
aaron 9872b65803 clip to the null allele on the reference string in VCF 4, instead of stopping to perserve one reference base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3613 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 20:52:19 +00:00
aaron a6d3e4bd47 Add code to allow reference alleles with 'N' in VariantContext, but not in the alternate allele(s). Also more updates to the VCF 4 code (fixed parsing for files without genotypes).
This check-in will temperarly break the build (I need to see if Bamboo is correctly returning the log file for the failed builds).  

Will be fixed once Bamboo starts building.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3609 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 18:26:37 +00:00
aaron 32f324a009 incremental changes to the VCF4 codec, including allele clipping down to the minimum reference allele; adding unit testing for certain aspects of the parsing. Not ready for prime-time yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3604 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 06:31:05 +00:00
depristo 57a13805da GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-19 21:36:41 +00:00
ebanks b75ded61b8 Removing obsolete rod; no longer needed given previous addition to SampleUtils.
JIRA GSA-318


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3572 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 20:03:14 +00:00
delangel 8cb16a1d45 a) Cleanup, remove -input argument from BeagleOutputToVCFWalker since it's not needed.
b) Added back old Beagle ROD to maintain backward compatibility (does anyone even use this???)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3563 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 02:13:08 +00:00
delangel d319a28be7 Complete rewrite of the Beagle functionality to read from Beagle output files and produce VCF with modified genotypes. Now, a new ROD system using Tribble is in place. Beagle inputs are set using -B beagleType,Beagle,pathToBeagleFile, where beagleType can be either beagleR2, beagleLike, beaglePhased or beagleR2 (BeagleOutputToVCFWalker requires all of the above). Only pending items: -input argument is now unused and can be removed, will be cleaned later. Wiki will be updated with new usage shortly.
We can now run with a reduced memory footprint, and output VCF is exactly identical to previous version. Drawback is increased runtime because Tribble has to create an index for all the Beagle files when starting if the idx files are missing.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3562 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 02:01:35 +00:00
aaron d265397bf6 removing a reference to a unused internal Sun class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3560 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 15:27:57 +00:00