gatk-3.8

Commit Graph

Author	SHA1	Message	Date
kshakir	30cf78fdc0	Refactoring for a first version of scatter gather api with basic shell script implementations. Modified build script so that queue is cleaned during "ant clean". git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3611 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 18:39:20 +00:00
aaron	a6d3e4bd47	Add code to allow reference alleles with 'N' in VariantContext, but not in the alternate allele(s). Also more updates to the VCF 4 code (fixed parsing for files without genotypes). This check-in will temperarly break the build (I need to see if Bamboo is correctly returning the log file for the failed builds). Will be fixed once Bamboo starts building. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3609 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 18:26:37 +00:00
ebanks	824c2bbac0	Finishing previous checkin git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3608 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 17:21:38 +00:00
ebanks	4727bcda24	Removing Beagle output from UG. Use ProduceBeagleInput walker instead (since it can be run post-filtration and respects the FILTER column). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3607 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 16:56:37 +00:00
aaron	32f324a009	incremental changes to the VCF4 codec, including allele clipping down to the minimum reference allele; adding unit testing for certain aspects of the parsing. Not ready for prime-time yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3604 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 06:31:05 +00:00
bthomas	de9f1f575f	Fixing command line parsing to accept negative number arguments. Command line definitions must now start with a letter or underscore; previously, they could start with a digit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3603 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:54:31 +00:00
bthomas	9d6a341d15	Fixing the error messages thrown with bad interval arguments. I simplified the exception handling and made the messages more verbose. Note: the -L argument takes both interval strings and filenames. If you specify an interval string that is also a file, an error will be thrown to move the file: ie. if you have a file "chr1" in the parent directory, GATK will ask you to move/delete it. But, this only happens with interval string arguments, NOT with intervals that are contained in files, which is a majority of the use case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3602 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:49:41 +00:00
bthomas	300a18b85f	Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually. Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:42:42 +00:00
ebanks	df1cadc4c9	Fix NullPointerException when priority list is left out git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3600 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 13:46:54 +00:00
hanna	c806ffba5f	Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations will not be as fast as they could be because the workflow is currently merge sam records (sharding) -> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext is updated to take advantage of the new functionality in ReadBackedPileup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 02:11:42 +00:00
hanna	1d50fc7087	Misc bug fixes: fix tracking of nInsertions with sample-split pileup constructor. Fix performance issue building up pileups from pileups of individual sample data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3598 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 20:32:27 +00:00
hanna	f18ac069e2	A refactoring / unification of ReadBackedPileup and ReadBackedExtendedEventPileup. Provides a cleaner interface with extended events inheriting all of the basic RBP functionality. Implementation is still slightly messy, but should allow users to provide separate implementations of methods for sample split pileups and unsplit pileups for efficiency's sake. Methods not covered by unit/integration tests have not been sufficiently tested yet. Unit tests will follow this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3597 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 04:42:26 +00:00
depristo	57a13805da	GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-19 21:36:41 +00:00
kiran	8ff93f77e6	Added evaluation module to count functional classes (missense, nonsense, etc.). At the moment, it only understands Cancer's MAF annotations. Added integration test for the functional class counting. Added better description for VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3595 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 21:51:40 +00:00
chartl	f44d8b150f	Mendelian Violation Classifier now filters violations on the fly via command line arguments; and closes unterminated homozygous regions at the end of a chromosome (so we see arms falling off in the file, rather than in the log) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3592 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 19:32:24 +00:00
ebanks	aa1852575e	Add -noVerbose flag to stop output of INFO data. Cuts runtime by 30% and output from 65Mb to 1Kb. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3591 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 18:53:35 +00:00
rpoplin	724affc3cc	Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 17:37:11 +00:00
hanna	52477bd9e6	Add some missing methods to the pileup architecture. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3588 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 15:03:08 +00:00
hanna	5050b19457	We're unable to make the naive deduper more worldly, so we're killing it instead. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3587 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 13:54:27 +00:00
aaron	b978d5946b	adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 08:23:23 +00:00
hanna	48cbc5ce37	Merging the sharding-specific inherited classes down into the base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 22:36:13 +00:00
hanna	612c3fdd9d	First pass at eliminating the old sharding system. Classes required for the original sharding system are gone where I could identify them, but hierarchies that split to support two sharding systems have not yet been taken apart. @Eric: ~4k lines. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 20:17:31 +00:00
delangel	b694ca9633	Cleanup: Don't require likelihood ROD in Beagle parameters when generating output VCF. Likelihoods file is only an input to Beagle but the Walker that generates a VCF doesn't need it, so it's silly to ask for it and it's error-prone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3579 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 17:45:48 +00:00
hanna	c1595a383a	More bugfixes for cases where no sample name is present. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3578 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 16:46:02 +00:00
aaron	3d049204ed	some refactoring for the variant eval output system git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3576 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 05:34:31 +00:00
hanna	db1383d0b2	Rev the latest version of Picard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3575 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 23:55:07 +00:00
hanna	5972ad1199	Fixes to mrl integration. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3573 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 20:40:10 +00:00
ebanks	b75ded61b8	Removing obsolete rod; no longer needed given previous addition to SampleUtils. JIRA GSA-318 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3572 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 20:03:14 +00:00
kshakir	c671864228	Re-allowing blacklist by read group id. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3571 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:45:44 +00:00
ebanks	f003703912	Allow specification of particular rods for pulling out sample names. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3570 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:37:09 +00:00
ebanks	01ffa307c2	When going NWay out in the cleaner, use the new merged header (instead of the original one) for each bam file so that it matches the new uniquified read group ids in the reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3569 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:36:36 +00:00
kshakir	05c2f96bb4	Small update to the command line docs for read_group_black_list. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3568 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:23:34 +00:00
ebanks	d7f3102c3f	Fixed read group blacklist filter to look only at readgroups (and not the read's themselves). Otherwise, it fails when attribute tags with different meanings show up in both places (e.g. SM). Added performance improvement. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3567 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:14:37 +00:00
hanna	e77f76f8e1	Reenabled downsampling by sample after basic sanity testing and fixes of the new implementation. Hard testing and performance enhancements are still pending. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3566 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 17:23:27 +00:00
kshakir	c44fd05aa1	Fix for a reflection issue with generic types. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3565 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 15:58:38 +00:00
ebanks	7a91dbd490	Renamed some of the column names in Ti/Tv and Concordance modules so that they are clearer. Removed ValidationRate module (it was busted). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3564 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 15:53:06 +00:00
delangel	8cb16a1d45	a) Cleanup, remove -input argument from BeagleOutputToVCFWalker since it's not needed. b) Added back old Beagle ROD to maintain backward compatibility (does anyone even use this???) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3563 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:13:08 +00:00
delangel	d319a28be7	Complete rewrite of the Beagle functionality to read from Beagle output files and produce VCF with modified genotypes. Now, a new ROD system using Tribble is in place. Beagle inputs are set using -B beagleType,Beagle,pathToBeagleFile, where beagleType can be either beagleR2, beagleLike, beaglePhased or beagleR2 (BeagleOutputToVCFWalker requires all of the above). Only pending items: -input argument is now unused and can be removed, will be cleaned later. Wiki will be updated with new usage shortly. We can now run with a reduced memory footprint, and output VCF is exactly identical to previous version. Drawback is increased runtime because Tribble has to create an index for all the Beagle files when starting if the idx files are missing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3562 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:01:35 +00:00
aaron	d265397bf6	removing a reference to a unused internal Sun class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3560 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 15:27:57 +00:00
asivache	42b8a8f295	slight change in output format git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3559 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 14:52:04 +00:00
kshakir	32fc221ffe	Replaced pattern matched pipeline spec with annotated objects. Old version is no longer available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3558 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 04:43:46 +00:00
sjia	b99a5e06f3	Added option to only consider alleles of > specific allele frequency. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3557 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 02:09:35 +00:00
hanna	8a895f481f	Proper exception chaining for troubleshooting Sendu's issue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3556 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 01:38:36 +00:00
sjia	8defb30796	Documentation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3555 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 21:31:01 +00:00
weisburd	c1046653a2	Fixed handling of records where gene-names are identical (eg. as in refseq NR_030638 in chr20) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3554 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 20:00:49 +00:00
weisburd	1e42984a16	Improved buffer-size arg handling git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3553 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 19:59:15 +00:00
sjia	b3c3023c3c	Allows callers to handle HLA reference files as input (rather than hard-coded paths) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3552 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 18:56:08 +00:00
asivache	9666d47d17	ooops, debug print now removed git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3550 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 18:07:12 +00:00
sjia	abdc8521ea	Added debug options for FindClosestHLAWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3549 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 17:52:03 +00:00
sjia	c38390eabb	Added option for min number of matches between reads and alleles required to consider reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3548 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 16:08:49 +00:00
asivache	4ab1f440c3	A new argument: --targetIntervalsSorted (boolean flag). If specified, the interval file is assumed to be sorted (duh!) and it is NOT slurped into the memory but instead traversed directly on disk as needed. If the file turns out to be unsorted, an exception will be thrown at the point where inconsistency occurs (can be late into the processing!). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3547 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 16:00:22 +00:00
asivache	671ac00748	A simple utility class that implements a merging Iterator<GenomeLoc> built over an interval or bed file (this is NOT a rod, but rather a direct line-by-line file reader that converts strings to genome locs on the fly and merges overlapping intervals) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3546 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:54:37 +00:00
asivache	f137bf8f85	now adaptor silently skips empty lines in the underlying string iterator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3545 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:35:07 +00:00
sjia	d8c963c91c	Remove PhaselikelihoodsWalker.java git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3544 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:21:43 +00:00
sjia	5704294f9d	HLA caller updated - now searches all (common and rare) alleles, more efficient read filtering and allele comparison runs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3543 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:14:40 +00:00
asivache	d51e6c45a7	a utility class; turns string iterator into GenomeLoc iterator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3542 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 14:07:44 +00:00
asivache	7b7d3341f0	trivial refactoring: isFile renamed to isIntervalFile and made public git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3541 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 14:02:23 +00:00
hanna	c3b68cc58d	Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept independent while processing, and only merged back in a priority queue if necessary in a special variant of the ReadBackedPileup. This code is not live yet except in the case of naive deduping. Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and not well integrated with StratifiedAlignmentContext or the walkers. Cleanup to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-13 01:47:02 +00:00
kiran	804facb0cc	Removing these utilities as part of a hostage negotation with Matt. Can I have my journal club paper now?! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3539 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 21:41:29 +00:00
asivache	e6d8faf293	making 'parseLocation' public static - as simple as the logic is, it's better kept in one place and I need it! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3537 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 18:19:59 +00:00
ebanks	8c28be5933	Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this). Updated dbsnp/hapmap membership info fields to be flags now instead of ints. While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 16:42:06 +00:00
ebanks	22620ba95c	Adding "abi_solid" to the list of known platforms. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3534 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 13:37:19 +00:00
ebanks	63ad71cca6	Fix busted code. Note for all: String.valueOf(byte[]) doesn't work. You must use new String(byte[]). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3533 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 05:01:48 +00:00
weisburd	338bb9adf4	CommandLineProgram for measuring java I/O speeds for large plain-text or gzipped files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3532 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 21:34:37 +00:00
weisburd	06fc5eecf8	Implemented TreeReducible - if num threads > 1, the output will be accumulated in memory and written to a vcf file at the end - in onTraveralDone(..). If num threads == 1, things will work as before - where vcf records are written to disk as soon as they are computed with map(..). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3530 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:57:23 +00:00
weisburd	3b375cb237	Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) - attempt 2 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3529 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:54:36 +00:00
bthomas	99b684ea89	Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:10:23 +00:00
hanna	f55f32d4ee	Bug fix. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3526 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 01:53:26 +00:00
ebanks	ca4eab1d23	Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields. Added an integration test for the read-less case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 20:36:46 +00:00
aaron	6941c81bfa	reverting revision 3522 to the old code until we fix the tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3524 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 19:25:02 +00:00
hanna	dbee21a50f	Bugfixes for the case when no read groups / no samples are available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3523 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:47:05 +00:00
weisburd	adc4c4e577	Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3522 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:11:43 +00:00
chartl	20167fd411	Final changes to MVC -- associates variants with regions of homozygosity in child and parents, corrects for genotype errors, and prints out a separate file with informationf or each region of homozygosity. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3521 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:05:37 +00:00
weisburd	fdded73861	Improved error reporting git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3520 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:52:48 +00:00
aaron	4f00e265a8	quick update for a change I implemented for Ryan git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3519 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:23:31 +00:00
aaron	ad98512f6c	adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:21:12 +00:00
weisburd	c1b7bcc786	Fixed handling of mitochondrial genes - added special cases such as ATT being a start codon in mitochondria. Added warning if a gene doesn't start with Met or end in a stop codon git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3517 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:15:47 +00:00
weisburd	4f1181974b	Added toString() method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3516 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:12:57 +00:00
ebanks	9b2fcc4711	Refactoring of the annotation system: 1. VA is now a ROD walker so it no longer requires reads (needs a little more testing) 2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs) 3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:05:51 +00:00
hanna	84563b37e5	Partial flattening of the hanger data structure. Hanger data structure is not currently as flat as it could / should be, but it's already comparable to the speed of the reference implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3512 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 16:28:49 +00:00
chartl	8f9e3e8ad7	Commit for Kiran; but this is now working, barring little exceptions that I've yet to run across... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3511 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 14:21:19 +00:00
hanna	c2858c8988	Minor performance enhancement. Checkpoint commit before major performance overhaul. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3504 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 21:39:39 +00:00
chartl	5ed2818ffb	Forgot to commit code i relied upon git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3503 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 21:01:35 +00:00
chartl	736098b58d	A quick commit before running home. This is a re-factored version of the OppositeHomozygoteClassifier which will work with deNovo violations as well. Some code still needs to be migrated from OHC which is wy that walker isn't yet deleted. This'll be up and running tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3502 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 20:47:01 +00:00
delangel	de134c226d	Removed ability of users to specify annotations to recompute, cleanups. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3501 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 19:17:59 +00:00
ebanks	4d1a6b3d99	quick changes for G git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3500 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 16:33:27 +00:00
delangel	907931c902	a) Update annotations when creating new vcf with Beagle's imputed data. Since genotypes may (will) change based on imputation, several annotations need to be updated. By default, AC, AF, AN and AB will be updated. User can force extra annotaqtions to be updated with -A <annotation> argument. b) Several cleanups and beautifications. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3499 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 15:12:04 +00:00
chartl	933133ee28	Initial commit of the opposite homozygote classifier. Currently does the following, given a trio vcf: + Identifies opposite homozygote sites + Identifies the parent from whom it is expected that a null allele was inherited (or whether it was a putative genotype error; e.g. mom=homref, dad=homref, child=homvar) + Labels each opposite homozygote with its homozygous region in the child (e.g. region 1, region 2) + Labels each opposite homozygote with the size of the homozygous region in which it was found, the number of child homozygotes in the region, and the number of opposite homozygote violations within that region To come: + Classification of sites as likely tri-allelic Note that this is very experimental git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3498 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 03:56:07 +00:00
hanna	199e4208cd	Bug fixes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3497 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 00:30:33 +00:00
hanna	52ab9f2417	Feature parity between LocusIteratorByState, DownsamplingLocusIteratorByState, including pushing mrl / the LocusOverflowTracker into LocusIteratorByState. Note that the 'Matt Hanna exception', is still enabled because I haven't yet validated the performance of the DownsamplingLocusIteratorByState when running without downsampling. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3496 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 22:58:21 +00:00
hanna	5c4d070566	Push Mark's changes in LocusIteratorByState into DownsamplingLocusIteratorByState in preparation for merging the two into one. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3495 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 17:29:30 +00:00
depristo	6eeb1693ca	JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 00:33:02 +00:00
hanna	c1ecf75dd5	Update to the latest rev of the picard sharding patch. Includes updates reflecting the imminent move of IlluminaUtil into picard public. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3493 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 20:33:21 +00:00
delangel	c503f01dcf	More cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3492 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 17:41:38 +00:00
delangel	d4c66d6191	a) Small cleanup b) Fix major issue with Beagle likelihood converter: if likelihood triplets from UG end up being too low, then Beagle input file will be produced with 0.00,0.00,0.00 triplet. If all samples at a marker have this issue, Beagle will effectively produce junk. To fix, likelihoods are renormalized before converting to linear space. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3491 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 17:31:59 +00:00
depristo	cfa18f6743	Fixing missed update with new Allele in it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3490 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 23:56:34 +00:00
depristo	3ea506fe52	No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 23:03:55 +00:00
delangel	ef47a69c50	a) First fully functional (sort of) version of walker that parses Beagle imputation output files and produce a vcf with imputed genotypes. More doc/info to follow shortly. Issues still to be solved: a) Walker changes all genotypes based on Beagle data, but annotations on the original VCF are unchanged. They should in theory be recomputed based on new genotypes. b) Current implementation is ugly, dirty unwieldy and will necessitate a refactoring soon so I can keep my pride. Most aesthetically affronting issue right now is that we read the full Beagle files at initialization and keep them in memory, but a more delicate implementation would just read from files on a marker by marker basis. Issue that currently prevents this is that BufferedReader() instances don't seem to play nice when called from the map() function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3488 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 20:37:25 +00:00
depristo	b811e61ae1	Optimized, nearly complete VCF4 reader 2-4x faster than the previous implementation, along with a VCF4 reader performance testing walker that can read 3/4 files, useful for benchmarking git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3487 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 18:11:38 +00:00
aaron	6482b87741	adding the super experimental, half-broken, generally crippled, awkwardly commented, header ignoring vcf4 code. Don't use this, unless you're a developer for VCF4. If so, remove the exception from the constructor so that it won't always exception out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3486 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 07:38:46 +00:00

1 2 3 4 5 ...

2945 Commits (a46e22ed13eccbd142fac40e3c5aa2db06a3ea97)