Commit Graph

108 Commits (7880863eb7d0d79b2df34df05103c604a85e7124)

Author SHA1 Message Date
depristo 7880863eb7 Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:07:38 +00:00
depristo 7ad8fbdd5a Moved GATKException to exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:47:19 +00:00
depristo 595907e98e Moving StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:34:15 +00:00
depristo 40e6179911 Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:02:43 +00:00
depristo 8f1a32acae All exceptions thrown by the GATK have been reviewed and UserErrors replaced where appropriate. Shazam. Another check-in will remove the GATKException and restore the StingException.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4252 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 15:25:30 +00:00
depristo 1de713f354 Massive review of maybe 50% of the exceptions in the GATK. GATKException is a tmp. tracker so that I can tell which StingExceptions I've reviewed. Please don't use it. If you are working on new code and are considering throwing exceptions, it's either UserError or StingException, please
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4246 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 23:21:17 +00:00
depristo 6a30617a60 Initial implementation of UserError exceptions and error message overhaul. UserErrors and their subclasses UserError.MalFormedBam for example should be used when the GATK detects errors on part of the user. The output for errors is now much clearer and hopefully will reduce GS posts. Please start using UserError and its subclasses in your code. I've replace some, but not all, of the StingExceptions in the GATK with UserError where appropriate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4239 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 11:32:20 +00:00
hanna d773b3264b Eliminated -mrl option.
Eliminated -fmq0 option.
Eliminated read group hallucination.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4133 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-26 21:38:03 +00:00
hanna 3dc78855fd Command-line argument tagging is in, and the ROD system is hacked slightly to support the new syntax
(-B:name,type file) as well as the old syntax.  Also, a bonus feature: BAMs can now be tagged at the
command-line, which should allow us to get rid of some of the hackier calls in GenomeAnalysisEngine.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4105 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-25 03:47:57 +00:00
hanna 41d57b7139 Massive cleanup of read filtering.
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general.  For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:17:11 +00:00
hanna 6b4a1e3b9f Reenabling code that was commented out after it was confirmed to work by many participating in this thread:
http://getsatisfaction.com/gsa/topics/error_thrown_when_reading_reference_file


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3981 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 00:12:09 +00:00
depristo f275041b1c -minimalVCF for CombineVariants. Work around for broken locking code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3960 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 16:10:59 +00:00
hanna f13d52e427 Attempt to determine whether underlying filesystem supports file locking and
disable on-the-fly dict and fai generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3938 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 19:28:27 +00:00
aaron 9667942e52 fix for Ryan's issue: we also need to sync when we store a resource.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3881 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-26 22:17:47 +00:00
aaron f4cfb0f990 The first step in integrating Jim's tree based index scheme:
- changed to a better method for getting headers from Codecs
- some removal of old commented out code in the GATKAgrumentCollection
- changes for the rename of FeatureReader to FeatureSource
- removed the old Beagle ROD
- cleaned up some of the code in SampleUtils

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 04:49:27 +00:00
bthomas f65cba6b9a Adding support for shared file locking via a new class for file locking, FSLockWithShared. This will eventually take over for FSLock, the current file locking class - I'll work with Aaron to merge the tribble code that uses FSLock right now.
FYI: creating an exclusive lock on a file that does not exist will create that file as an empty file, and will NOT delete that file after the program terminates. So watch out if it's possible that the file you're locking does not exist - could end up leaving extra files that confuse users.  



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3795 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 20:45:51 +00:00
hanna 9fc05ac2ae eagerDecode is now false.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3733 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 22:51:48 +00:00
hanna cab8394103 The sharding system now buffers reads, with a size determined by command-line argument. Will investigate whether/how this
impacts performance on low-pass data and, if it works well, will create a more automatic version of the tool.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3709 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:28:55 +00:00
aaron 4903d1fb4f fix for a parallelization issue: moving the creation of iterators outside of the sync block so we don't wait for RMD tracks to seek to the correct location. Thanks to Ben for providing the test case!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3703 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 16:37:02 +00:00
hanna 4995950d04 IndexedFastaSequenceFile is now in Picard; transitioning to that implementation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 04:40:31 +00:00
bthomas 300a18b85f Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually.
Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. 

Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 21:42:42 +00:00
hanna 48cbc5ce37 Merging the sharding-specific inherited classes down into the base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:36:13 +00:00
hanna 612c3fdd9d First pass at eliminating the old sharding system. Classes required for the original sharding system
are gone where I could identify them, but hierarchies that split to support two sharding systems have
not yet been taken apart.
@Eric: ~4k lines.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 20:17:31 +00:00
hanna 8a895f481f Proper exception chaining for troubleshooting Sendu's issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3556 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 01:38:36 +00:00
bthomas 99b684ea89 Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:10:23 +00:00
aaron 0b03e28b60 updating the tribble library to include the reference dictionary reading / writing. We now check the dictionaries of any tracks that have them against the reference (all new tribble tracks and out-of-date tracks will have this). Also renamed some classes to be more reflective of their function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3485 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 06:34:26 +00:00
depristo dfc36c1e95 Restructuring of the mandatory read filters for traversals. Now everything uses ReadFilters, even for the required filters like being mapped for LocusWalkers. Statistics now tracked for each read filter used during the traversal and info emitted in INFO at the end.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3445 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 22:12:25 +00:00
hanna 017ab6b690 Experimental versions of downsampler and Ryan's deduper are now available either
as walker attributes or from the command-line.  Not ready yet!  Downsampling/deduping 
works in a general sense, but this approach has not been completely optimized or validated.
Use with caution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 05:40:05 +00:00
aaron 7cfb9ff3dc updates for Tribble 82, fixes for Ryans case where multiple processes would attempt to read/write to the same index, and a couple other Tribble-centric bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3382 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 19:34:45 +00:00
aaron 2c55ac1374 fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 06:13:26 +00:00
hanna 76efa757f0 Switched over to reviewed version of Picard patch. In process, did some optimization to the IntervalSharder
which improved startup time 5-10x when dynamically merging many BAMs.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3331 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-08 14:12:22 +00:00
aaron f497213933 DbSNP moved over to tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3288 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 06:02:35 +00:00
aaron cbed0b1ade Adding GeliText tribble track as the first enabled Tribble track. This mean 'Variants' is no longer valid for a ROD type, use GeliText instead. I've updated all the references in the codebase.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3271 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-29 22:50:17 +00:00
aaron 4d75b26b7a Removing the code that made the ROD system case insensitive. Anyone using specific ROD names in their classes should take care in naming required tracks; All lowercase is the best practice.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3184 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 06:17:31 +00:00
aaron 3d3d19a6a7 the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date.
a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects.  This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc.  This layer is needed so we can integrate Tribble tracks (which don't natively have names).  Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc).

Sorry for the inconvenience!  More changes to come, but this is by far the largest (as has the greatest effect on end users).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 22:39:56 +00:00
hanna 4fcee248f9 For Kristian: functions which, given a read, can uniquely identify the BAM file storing that read.
Introducing this into the pile of code which peeks under the covers of the SAMDataSource in the hopes
that this function can help to replace the others and provide a single path for crosstalk.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3103 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 20:46:44 +00:00
hanna b60197ae10 Another round of cleanup and simplification in Picard -- Picard's unit tests
are now passing for my branch.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3100 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 01:02:59 +00:00
hanna 400684542c Revisions to take into account finalization of Picard patch: naming changes, better definition
of public interfaces.  This won't be the last Picard patch, but it should be the last big one.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3096 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 19:28:14 +00:00
hanna 85037ab13f Fix for Kiran's sharding issue (Invalid GZIP header). General cleanup of
Picard patch, including move of some of the Picard private classes we use to Picard public.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3087 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 03:21:27 +00:00
hanna 884a577013 Phase 2 of Picard patch refactoring: kill off SAMFileReader2/BAMFileReader2, merging the changes back into the base classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3065 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 16:48:11 +00:00
hanna b4b4e8d672 For Sarah Calvo: initial implementation of read pair traversal, for BAM files
sorted by read name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3052 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-21 23:22:25 +00:00
hanna c0eb5c27ea Lower memory support for merged sharding. Merged sharding is still not available.
WARNING: If you update frequently, you might have to rm -rf ~/.ant/cache -- this is an unfortunate side effect of the way we
	 distribute picard-private.jar.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3050 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 22:03:47 +00:00
hanna 849bd1f451 Set the eagerDecode flag in such a way that the binary data block in the BAM will always be considered dirty.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3014 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 22:01:23 +00:00
hanna 59045ccb28 Filter,merge performs much better than merge,filter. Many thanks to Eric for checking in an integration test that so compellingly demonstrates this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3011 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 16:23:37 +00:00
hanna 6dd5f192e7 Performance improvements for RODs in conjunction with new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3010 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 14:54:12 +00:00
hanna a7ba88e649 Rework the way the MicroScheduler handles locus shards to handle intervals that span shards
with less memory consumption.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 18:40:31 +00:00
aaron dde9fd8a15 some rods-for-reads cleaning and performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 22:54:58 +00:00
hanna eb165ca844 Celebrate the fact that the new sharding system works with integration tests
by removing the scary debug line.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2950 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 23:40:56 +00:00
hanna 9e107513d0 In the new sharding system, if no read group is present, hallucinate one. Added
for test compatibility, but not sure whether we still need this feature.  TODO: Poll the group about this feature.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2949 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 23:01:34 +00:00
hanna a7fe07c404 A few stopgap fixes to get the GATK to the point where the old sharding
infrastructure can be torn down:
1) New sharding system emulates old MonolithicSharding mechanism.
2) Better awareness of differences between fasta and BAM files when creating
   shards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 21:01:25 +00:00