gatk-3.8

Commit Graph

Author	SHA1	Message	Date
hanna	af31d02a2d	Fix concurrency issue that periodically kills VariantEvalIntegrationTest -- a member field of RMDTrackBuilder was getting rebuilt every time it was called, creating concurrency issues. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5001 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-14 18:52:21 +00:00
hanna	6d855041ec	Oops...forgot to commit the changes that allow primitive VCF streaming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-12 21:54:51 +00:00
aaron	7916ab0ed5	remove the index each run git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4976 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-12 17:38:22 +00:00
aaron	cba436fa2f	small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 21:20:26 +00:00
hanna	0982d35f5b	Bug fixes in streaming in Tribble data via /dev/stdin. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 02:43:04 +00:00
hanna	3fc9862964	Unit test fixed - Tribble codecs aren't designed to be stateless, but I was using one as though it was. Fixed, and debug code reverted. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 17:47:52 +00:00
hanna	b9cb57f4b9	A unit test is failing on bamboo in a way I can't reproduce (or even explain). Checking in some debugging info. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 16:35:04 +00:00
hanna	cba18116e4	A significant refactoring of the ROD system, done largely to simplify the process of streaming/piping VCFs into the GATK. Notable changes: - Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build RMDTracks and lookup codecs. - RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class are the walkers that feel the need to access the ROD system directly. (We need to stamp out this access pattern. A few minor warts were introduced as part of this process, labeled with TODOs. These'll be fixed as part of the VCF streaming project. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 04:52:22 +00:00
hanna	e313eeede8	Push command-line expansions, such as BAM list unpacking and -B tag parsing, out into the CommandLine* classes. This makes it easier for external functionality (such as the VCF streamer) to use GenomeAnalysisEngine directly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-22 19:00:17 +00:00
depristo	44feb4a362	Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-05 18:29:39 +00:00
bthomas	374c0deba2	Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-19 19:59:05 +00:00
hanna	90711d445c	Change the interface for RMDTrackBuilder, therefore always mandating the specification of a sequence dictionary and related info. This will hopefully eliminate the cases in which the refseq track depends a sequence dictionary / contig parser that hasn't been specified. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-17 19:00:17 +00:00
hanna	5b83942cee	- Fix DepthOfCoverage so that, when it abuses the ROD system by instantiating a track in onTraversalDone, it also supplies the correct sequence dictionary and parser. - Changed RMDTrackBuilder to use SequenceDictionaryUtils.validateDictionaries for ref <-> ROD sequence dictionary validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4683 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 20:34:04 +00:00
hanna	8e36a07bea	Convert GenomeLocParser into an instance variable. This change is required for anything that needs to be simultaneously aware of multiple references, eg Queue's interval sharding code, liftover support, distributed GATK etc. GenomeLocParser instances must now be used to create/parse GenomeLocs. GenomeLocParser instances are available in walkers by calling either -getToolkit().getGenomeLocParser() or -refContext.getGenomeLocParser() This is an intermediate change; GenomeLocParser will eventually be merged with the reference, but we're not clear exactly how to do that yet. This will become clearer when contig aliasing is implemented. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-10 17:59:50 +00:00
hanna	861ee3e37a	Changing testing framework from junit -> testng, for its enhanced configurability. Initial test to see how Bamboo will respond. More detailed email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-01 21:31:44 +00:00
bthomas	96cccafb0d	Adding a few helper methods for accessing sample metadata, and associated unit tests. These are motivated by discussion with Ryan about how he'll use sample metadata in VariantEvalwalker - hopefully will make it easier for him. Methods are: -- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value -- getToolkit().getSamplesWithProperty(): gets all samples with a given property -- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 02:16:25 +00:00
kshakir	edaa278edd	Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance. This will allow other programs like Queue to reuse the functionality. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-25 02:49:30 +00:00
kshakir	4ed9f437e9	Sliced the GAE in half like a gordian knot to avoid the constant merge conflicts. The GAE half has all the walker specific code. The new "Abstract" GAE has the rest of the logic. More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-23 23:28:55 +00:00
hanna	8f75d88519	Fix for GATK run report ids: mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ 8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf iLhvHWveypKb2F8vKS5irHylc3pYvlOb HDttXKUMEVoPrvVeWrH7E0htxYyNydMx plus a bit of cleanup of custom exceptions in the sharding system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-22 19:49:25 +00:00
bthomas	c6c6d32b46	Quickly adding a new convenience method for retreiving a group of samples. The method is getSamples(Collection<String>) and returns a set of sample objects. There's also a test there. Ryan is using this to modify VCF code today... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4303 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-17 15:55:17 +00:00
bthomas	f66ef4626e	Fixing two minor issues: 1) adding a new error message if the user adds a fasta file in a directory that doesn't exist; 2) renaming my sample unit tests so they actually run. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4299 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-16 20:45:51 +00:00
aaron	782e0018e4	removal of most of the old GATK ROD system; also a fix for -Dsingle so we can again run just a single unit or integration test (single tests in tribble can be run with the -DsingleTest option now). More to come. * Three integration tests had to change: * RecalibarationWalkersIntegrationTest: One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates) SequenomValidationConverterIntegrationTest: relies on Plink ROD which we've removed. PileupWalkerIntegrationTest: we no longer have implicit interval tracks, so there isn't a rod name over the specified region. Otherwise the same result. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-15 22:54:49 +00:00
bthomas	e5f81d25d4	Adding the --sample-metadata (-SM) command line argument and associated functionality. This is something Matt and I have been working on for a while. Basically, it allows you to integrate sample metadata into an analysis, by including a sample file. More detailed documentation is on the wiki: http://www.broadinstitute.org/gsa/wiki/index.php/Adding_Sample_data_to_an_analysis This commit adds two important classes: Sample, which contains data about one sample; and SampleDataSource, which manages sample data a la ReferenceDataSource and ReadsDataSource. This code should be stable, but it has not been integrated with existing walkers yet. That's the next commit. In the meantime, feel free to experiment with the code - there are two basic example walkers in the playground.sample package. And PLEASE let me know if you see any errors/inconsistencies. Note that this also adds a new dependency on SnakeYaml, a YAML parser. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4285 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-15 11:50:22 +00:00
depristo	7880863eb7	Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 15:07:38 +00:00
depristo	7ad8fbdd5a	Moved GATKException to exceptions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:47:19 +00:00
depristo	595907e98e	Moving StingException git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:34:15 +00:00
depristo	40e6179911	Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:02:43 +00:00
hanna	3dc78855fd	Command-line argument tagging is in, and the ROD system is hacked slightly to support the new syntax (-B:name,type file) as well as the old syntax. Also, a bonus feature: BAMs can now be tagged at the command-line, which should allow us to get rid of some of the hackier calls in GenomeAnalysisEngine. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4105 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-25 03:47:57 +00:00
hanna	41d57b7139	Massive cleanup of read filtering. - Eliminate reduncancy of filter application. - Track filter metrics per-shard to facitate per merging. - Flatten counting iterator hierarchy for easier debugging. - Rename Reads class to ReadProperties and track it outside of the Sting iterators. Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics classes are managed by the SAMDataSource when they should be managed by something more general. For now, we're hacking the reads data source to manage the metrics; in the future, something more general should manage the metrics classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-11 20:17:11 +00:00
ebanks	340bd0e2c1	Removed hard-coded pointers to references git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 17:59:37 +00:00
hanna	4995950d04	IndexedFastaSequenceFile is now in Picard; transitioning to that implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 04:40:31 +00:00
hanna	48cbc5ce37	Merging the sharding-specific inherited classes down into the base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 22:36:13 +00:00
hanna	612c3fdd9d	First pass at eliminating the old sharding system. Classes required for the original sharding system are gone where I could identify them, but hierarchies that split to support two sharding systems have not yet been taken apart. @Eric: ~4k lines. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 20:17:31 +00:00
hanna	db1383d0b2	Rev the latest version of Picard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3575 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 23:55:07 +00:00
depristo	e2b41082af	GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 22:26:32 +00:00
depristo	2b02324587	Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-30 18:00:12 +00:00
aaron	a4d834cc01	fixing the test I broke git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3454 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-28 02:06:20 +00:00
depristo	6ce3835622	Removing unused methods in QualityUtils; ReferenceContext now converting all bases to upper case, but can be disabled with static boolean git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3399 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 12:38:06 +00:00
depristo	8a725b6c93	Restructuring of ReferenceContext and ReadWalkers to accept a ReferenceContext. Now ReferenceContext is byte[] backed not char[]. Please no more chars for the reference. All of the tests pass now. Coming check-ins are going to clean up the char / byte problems in the GATK git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3397 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 23:27:55 +00:00
aaron	2c55ac1374	fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-12 06:13:26 +00:00
aaron	4014a8a674	A long overdue correction; all unit tests now end in 'UnitTest'. This was something we wanted to do for a while, and now with the performance tests coming, it was a good time to clean-up. Please label any new test appropriately: UnitTest and IntegrationTest are the two valid file name patterns for tests. Thanks! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3135 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-08 06:14:15 +00:00
aaron	3d3d19a6a7	the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date. a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects. This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc. This layer is needed so we can integrate Tribble tracks (which don't natively have names). Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc). Sorry for the inconvenience! More changes to come, but this is by far the largest (as has the greatest effect on end users). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-31 22:39:56 +00:00
aaron	5079f35e40	better method names for read based reference ordered data access. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3069 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-24 16:13:31 +00:00
asivache	ee1dc6092f	Test updated. Now we do not throw an exception when locus interval is out of bounds, we just return silently a reference context trimmed to the current shard boundaries. New test checks for trimming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3058 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 17:37:52 +00:00
hanna	6dd5f192e7	Performance improvements for RODs in conjunction with new sharding system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3010 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-16 14:54:12 +00:00
aaron	88a48821ea	removed the dependence on removeRegion() in GenomeLocSortedSet git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2993 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:35:49 +00:00
aaron	661a043cef	adding methods to get RODs by name or type in read traversals, performance improvements to RODs for Reads in general, and some more Tribble infrastructure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2984 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-11 21:13:39 +00:00
hanna	a7ba88e649	Rework the way the MicroScheduler handles locus shards to handle intervals that span shards with less memory consumption. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-11 18:40:31 +00:00
aaron	dde9fd8a15	some rods-for-reads cleaning and performance improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 22:54:58 +00:00
hanna	a7fe07c404	A few stopgap fixes to get the GATK to the point where the old sharding infrastructure can be torn down: 1) New sharding system emulates old MonolithicSharding mechanism. 2) Better awareness of differences between fasta and BAM files when creating shards. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-07 21:01:25 +00:00

1 2

85 Commits (2b895ffb7ffafc7fdf02dcbb7048daeefc89544c)