gatk-3.8

Commit Graph

Author	SHA1	Message	Date
bthomas	300a18b85f	Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually. Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:42:42 +00:00
ebanks	df1cadc4c9	Fix NullPointerException when priority list is left out git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3600 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 13:46:54 +00:00
hanna	c806ffba5f	Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations will not be as fast as they could be because the workflow is currently merge sam records (sharding) -> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext is updated to take advantage of the new functionality in ReadBackedPileup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 02:11:42 +00:00
hanna	1d50fc7087	Misc bug fixes: fix tracking of nInsertions with sample-split pileup constructor. Fix performance issue building up pileups from pileups of individual sample data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3598 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 20:32:27 +00:00
hanna	f18ac069e2	A refactoring / unification of ReadBackedPileup and ReadBackedExtendedEventPileup. Provides a cleaner interface with extended events inheriting all of the basic RBP functionality. Implementation is still slightly messy, but should allow users to provide separate implementations of methods for sample split pileups and unsplit pileups for efficiency's sake. Methods not covered by unit/integration tests have not been sufficiently tested yet. Unit tests will follow this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3597 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 04:42:26 +00:00
depristo	57a13805da	GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-19 21:36:41 +00:00
kiran	8ff93f77e6	Added evaluation module to count functional classes (missense, nonsense, etc.). At the moment, it only understands Cancer's MAF annotations. Added integration test for the functional class counting. Added better description for VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3595 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 21:51:40 +00:00
rpoplin	724affc3cc	Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 17:37:11 +00:00
hanna	5050b19457	We're unable to make the naive deduper more worldly, so we're killing it instead. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3587 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 13:54:27 +00:00
aaron	b978d5946b	adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 08:23:23 +00:00
hanna	48cbc5ce37	Merging the sharding-specific inherited classes down into the base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 22:36:13 +00:00
hanna	612c3fdd9d	First pass at eliminating the old sharding system. Classes required for the original sharding system are gone where I could identify them, but hierarchies that split to support two sharding systems have not yet been taken apart. @Eric: ~4k lines. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 20:17:31 +00:00
hanna	c1595a383a	More bugfixes for cases where no sample name is present. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3578 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 16:46:02 +00:00
hanna	5972ad1199	Fixes to mrl integration. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3573 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 20:40:10 +00:00
ebanks	b75ded61b8	Removing obsolete rod; no longer needed given previous addition to SampleUtils. JIRA GSA-318 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3572 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 20:03:14 +00:00
kshakir	c671864228	Re-allowing blacklist by read group id. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3571 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:45:44 +00:00
ebanks	01ffa307c2	When going NWay out in the cleaner, use the new merged header (instead of the original one) for each bam file so that it matches the new uniquified read group ids in the reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3569 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:36:36 +00:00
kshakir	05c2f96bb4	Small update to the command line docs for read_group_black_list. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3568 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:23:34 +00:00
ebanks	d7f3102c3f	Fixed read group blacklist filter to look only at readgroups (and not the read's themselves). Otherwise, it fails when attribute tags with different meanings show up in both places (e.g. SM). Added performance improvement. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3567 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:14:37 +00:00
hanna	e77f76f8e1	Reenabled downsampling by sample after basic sanity testing and fixes of the new implementation. Hard testing and performance enhancements are still pending. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3566 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 17:23:27 +00:00
ebanks	7a91dbd490	Renamed some of the column names in Ti/Tv and Concordance modules so that they are clearer. Removed ValidationRate module (it was busted). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3564 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 15:53:06 +00:00
delangel	8cb16a1d45	a) Cleanup, remove -input argument from BeagleOutputToVCFWalker since it's not needed. b) Added back old Beagle ROD to maintain backward compatibility (does anyone even use this???) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3563 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:13:08 +00:00
delangel	d319a28be7	Complete rewrite of the Beagle functionality to read from Beagle output files and produce VCF with modified genotypes. Now, a new ROD system using Tribble is in place. Beagle inputs are set using -B beagleType,Beagle,pathToBeagleFile, where beagleType can be either beagleR2, beagleLike, beaglePhased or beagleR2 (BeagleOutputToVCFWalker requires all of the above). Only pending items: -input argument is now unused and can be removed, will be cleaned later. Wiki will be updated with new usage shortly. We can now run with a reduced memory footprint, and output VCF is exactly identical to previous version. Drawback is increased runtime because Tribble has to create an index for all the Beagle files when starting if the idx files are missing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3562 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:01:35 +00:00
aaron	d265397bf6	removing a reference to a unused internal Sun class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3560 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 15:27:57 +00:00
asivache	42b8a8f295	slight change in output format git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3559 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 14:52:04 +00:00
hanna	8a895f481f	Proper exception chaining for troubleshooting Sendu's issue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3556 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 01:38:36 +00:00
asivache	9666d47d17	ooops, debug print now removed git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3550 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 18:07:12 +00:00
asivache	4ab1f440c3	A new argument: --targetIntervalsSorted (boolean flag). If specified, the interval file is assumed to be sorted (duh!) and it is NOT slurped into the memory but instead traversed directly on disk as needed. If the file turns out to be unsorted, an exception will be thrown at the point where inconsistency occurs (can be late into the processing!). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3547 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 16:00:22 +00:00
asivache	f137bf8f85	now adaptor silently skips empty lines in the underlying string iterator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3545 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:35:07 +00:00
asivache	d51e6c45a7	a utility class; turns string iterator into GenomeLoc iterator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3542 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 14:07:44 +00:00
hanna	c3b68cc58d	Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept independent while processing, and only merged back in a priority queue if necessary in a special variant of the ReadBackedPileup. This code is not live yet except in the case of naive deduping. Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and not well integrated with StratifiedAlignmentContext or the walkers. Cleanup to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-13 01:47:02 +00:00
ebanks	8c28be5933	Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this). Updated dbsnp/hapmap membership info fields to be flags now instead of ints. While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 16:42:06 +00:00
ebanks	22620ba95c	Adding "abi_solid" to the list of known platforms. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3534 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 13:37:19 +00:00
ebanks	63ad71cca6	Fix busted code. Note for all: String.valueOf(byte[]) doesn't work. You must use new String(byte[]). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3533 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 05:01:48 +00:00
bthomas	99b684ea89	Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:10:23 +00:00
hanna	f55f32d4ee	Bug fix. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3526 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 01:53:26 +00:00
ebanks	ca4eab1d23	Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields. Added an integration test for the read-less case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 20:36:46 +00:00
hanna	dbee21a50f	Bugfixes for the case when no read groups / no samples are available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3523 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:47:05 +00:00
aaron	4f00e265a8	quick update for a change I implemented for Ryan git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3519 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:23:31 +00:00
aaron	ad98512f6c	adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:21:12 +00:00
ebanks	9b2fcc4711	Refactoring of the annotation system: 1. VA is now a ROD walker so it no longer requires reads (needs a little more testing) 2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs) 3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:05:51 +00:00
hanna	84563b37e5	Partial flattening of the hanger data structure. Hanger data structure is not currently as flat as it could / should be, but it's already comparable to the speed of the reference implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3512 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 16:28:49 +00:00
hanna	c2858c8988	Minor performance enhancement. Checkpoint commit before major performance overhaul. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3504 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 21:39:39 +00:00
chartl	5ed2818ffb	Forgot to commit code i relied upon git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3503 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 21:01:35 +00:00
hanna	199e4208cd	Bug fixes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3497 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 00:30:33 +00:00
hanna	52ab9f2417	Feature parity between LocusIteratorByState, DownsamplingLocusIteratorByState, including pushing mrl / the LocusOverflowTracker into LocusIteratorByState. Note that the 'Matt Hanna exception', is still enabled because I haven't yet validated the performance of the DownsamplingLocusIteratorByState when running without downsampling. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3496 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 22:58:21 +00:00
hanna	5c4d070566	Push Mark's changes in LocusIteratorByState into DownsamplingLocusIteratorByState in preparation for merging the two into one. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3495 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 17:29:30 +00:00
depristo	6eeb1693ca	JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 00:33:02 +00:00
hanna	c1ecf75dd5	Update to the latest rev of the picard sharding patch. Includes updates reflecting the imminent move of IlluminaUtil into picard public. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3493 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 20:33:21 +00:00
depristo	3ea506fe52	No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 23:03:55 +00:00

1 2 3 4 5 ...

1507 Commits (de9f1f575f2a322effc938dce7c9f339496976df)