gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	1c03ebc82d	Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP -- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure -- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map -- Cleanup misc. code as it came up -- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC	2013-06-10 16:20:31 -04:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
Mark DePristo	39e4396de0	New ActiveRegionShardBalancer allows efficient NanoScheduling -- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome. These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing. -- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard. This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries. This simple change allows us to scale efficiently to around 8 threads in the nano scheduler: -- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf -- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf -- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate. -- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected. -- Fix bad toString in FilePointer	2013-05-13 11:09:02 -04:00
David Roazen	4d56142163	Detect stuck lock-acquisition calls, and disable file locking for tests -Acquire file locks in a background thread with a timeout of 30 seconds, and throw a UserException if a lock acquisition call times out * should solve the locking issue for most people provided they RETRY failed farm jobs * since we use NON-BLOCKING lock acquisition calls, any call that takes longer than a second or two indicates a problem with the underlying OS file lock support * use daemon threads so that stuck lock acquisition tasks don't prevent the JVM from exiting -Disable both auto-index creation and file locking for integration tests via a hidden GATK argument --disable_auto_index_creation_and_locking_when_reading_rods * argument not safe for general use, since it allows reading from an index file without first acquiring a lock * this is fine for the test suite, since all index files already exist for test files (or if they don't, they should!) -Added missing indices for files in private/testdata -Had to delete most of RMDTrackBuilderUnitTest, since it mostly tested auto-index creation, which we can't test with locking disabled, but I replaced the deleted tests with some tests of my own. -Unit test for FSLockWithShared to test the timeout feature	2013-04-24 22:49:02 -04:00
David Roazen	2eac97a76c	Remove auto-creation of fai/dict files for fasta references -A UserException is now thrown if either the fai or dict file for the reference does not exist, with pointers to instructions for creating these files. -Gets rid of problematic file locking that was causing intermittent errors on our farm. -Integration tests to verify that correct exceptions are thrown in the case of a missing fai / dict file. GSA-866 #resolve	2013-04-02 18:34:08 -04:00
Eric Banks	ebd5404124	Fixed the add functionality of GenomeLocSortedSet. * Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown. * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases. * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs. * Added more unit tests for add functionality of GLSS. * Resolves GSA-775.	2013-02-28 23:31:00 -05:00
Yossi Farjoun	3a7c8c13e2	Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads. - Fixed a deterioration in BAMIndex due to rev'ed picard (see below) - Added unit tests for SeekableBufferedStream - Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest) - Added a runtime-test to verify that the amount read equals the amount requested. - Added failing tests with expectedExceptions - Used a DataProvider to make code nicer	2013-02-14 17:51:15 -05:00
David Roazen	3744d1a596	Collapse the downsampling fork in the GATK engine With LegacyLocusIteratorByState deleted, the legacy downsampling implementation was already non-functional. This commit removes all remaining code in the engine belonging to the legacy implementation.	2013-01-28 01:50:30 -05:00
Mark DePristo	94cb50d3d6	Retire LegacyLocusIteratorByState -- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK	2013-01-11 15:17:18 -05:00
Mark DePristo	cc0c1b752a	Delete old LocusIteratorByState, leaving only new LIBS and legacy	2013-01-11 15:17:18 -05:00
Mark DePristo	0ac4352614	LIBS can now (optionally) track the unique reads it uses from the underlying read iterator -- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS -- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working -- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing	2013-01-11 15:17:16 -05:00
Mark DePristo	b3ecfbfce8	Refactor LIBS into component parts, expand unit tests, some code cleanup -- Split out all of the inner classes of LIBS into separate independent classes -- Split / add unit tests for many of these components. -- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works -- No need to change unit tests or integration tests. No change in functionality. -- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested	2013-01-11 15:17:16 -05:00
Mark DePristo	b2990497e2	Refactor LIBS into utils.locusiterator before refactoring	2013-01-11 15:17:16 -05:00
Mauricio Carneiro	2a4ccfe6fd	Updated all JAVA file licenses accordingly GSATDG-5	2013-01-10 17:06:41 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Mark DePristo	cf91d894e4	Fix build problems with tests	2012-08-31 13:42:41 -04:00
Mark DePristo	1212dfd2ef	Reduce the number of test combinations in ReadBasedREferenceOrderedView	2012-08-30 19:41:49 -04:00
Mark DePristo	21dd70ed36	Test to ensure that ReadBasedReferenceOrderedView produces stateless objects -- Stateless objects are required for nano-scheduling. This means you can take the RefMetaDataTracker provided by ReadBasedReferenceOrderedView, store it way, get another from the same view, and the original one behaves the same.	2012-08-30 10:15:11 -04:00
Mark DePristo	1200848bbf	Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers -- Deleted ReadMetaDataTracker -- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view	2012-08-30 10:15:10 -04:00
Mark DePristo	972be8b4a4	Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers -- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them -- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class. -- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely. -- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView -- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself. -- This commit breaks the new read walker BQSR, but Ryan knows this is coming -- Subsequent commit will be retiring / fixing ValidateRODForReads	2012-08-30 10:15:10 -04:00
Mark DePristo	8fc6a0a68b	Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker	2012-08-30 10:13:06 -04:00
Eric Banks	eca9613356	Adding support of X and = CIGAR operators to the GATK	2012-08-10 14:54:07 -04:00
Joel Thibault	2b25df3d53	Add removeProgramRecords argument * Add unit test for the removeProgramRecords	2012-08-01 15:33:05 -04:00
Mauricio Carneiro	116885a450	Removed the "Walker" suffix from all walkers that had it. * Did not touch archived walkers... those can be named whatever. * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...) * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.	2012-07-20 17:27:11 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Khalid Shakir	9801dd114f	Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag() Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.	2012-04-27 09:58:38 -04:00
Eric Banks	99d27ddcc4	Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.	2012-04-02 14:27:36 -04:00
David Roazen	32dee7ed9b	Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be thrown when a BAM index file was truncated or corrupt. Now, a UserException is thrown in this situation instructing the user to re-index the BAM. Added a unit test for this case as well.	2012-03-08 15:30:44 -05:00
Ryan Poplin	41ffd08d53	On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live.	2012-02-13 12:35:09 -05:00
Matt Hanna	4001c22a11	Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.	2011-12-06 10:10:38 -05:00
Matt Hanna	8bb4d4dca3	First pass of the asynchronous block loader. Block loads are only triggered on queue empty at this point. Disabled by default (enable with nt:io=?).	2011-11-18 15:02:59 -05:00
Mauricio Carneiro	e89ff063fc	GATKSAMRecord refactor The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...). * No tools should create SAMRecord anymore, use GATKSAMRecord instead *	2011-11-03 15:43:26 -04:00
Mark DePristo	89ac50e86e	SampleDataSource -> SampleDB	2011-10-03 09:33:30 -07:00
Mark DePristo	84160bd83f	Reorganization of Sample -- Moved Gender and Afflication to separate public enums -- PedReader 90% implemented -- Improve interface cleanup to XReadLines and UserException	2011-09-30 15:50:54 -04:00
Mark DePristo	56f10b40a8	Fixing test bugs for WindowMaker that required empty sample list	2011-09-30 14:18:27 -04:00
Mark DePristo	9458f01409	Test cleanup of Sample object	2011-09-29 15:13:05 -04:00
Mark DePristo	505416b6c0	Merge branch 'master' into ped	2011-09-29 12:22:39 -04:00
Mark DePristo	5c9227cf5e	Further cleanup of Sample database -- Removing more and more unnecessary code -- Partial removal of type safe Sample usage. On the road to SampleDB only	2011-09-29 11:50:05 -04:00
Mark DePristo	e76f381628	Moved sample package from DataSources to gatk, and renamed it samples -- All associated changes to the codebase are just header updates	2011-09-29 09:57:15 -04:00
Matt Hanna	0acaf2df65	Fix an embarrassing issue where a specific configuration of minimal coverage over small intervals could cause reads to be dropped from the pileup. Nothing to see here...	2011-09-28 21:23:01 -04:00
Mark DePristo	c2287c93d7	Cleanup of codec locations. No more dbSNPHelper -- refdata/features now in utils/codecs with the other codecs -- Deleted dbsnpHelper. rsID function now in VCFutils. Remaining code either deleted or put into VariantContextAdaptors -- Many associated import updates due to code move	2011-08-18 10:02:46 -04:00
Mark DePristo	8f696c7731	Continuing progress towards RodBinding 1.0 -- Cleaning up old interface to RMDT, docs and contracts added -- Proper type checking for RodBinding for cases where the Tribble type isn't found or is the wrong type	2011-08-03 17:19:28 -04:00
Mark DePristo	b25140db83	Contracts and documentation for some of RefMetaDataTracker Continuing to fix integration tests that don't pass / run	2011-08-03 13:34:20 -04:00
Mark DePristo	39b4e76fde	Continuing refactoring of RefMetaDataTracker. On the path towards converging getVariantContext() and getValues() in tracker so that we can have a single approach to get values from RODs with the new RodBinding() types	2011-07-28 17:48:28 -04:00

1 2

53 Commits (fdfe4e41d5d8c92fad74f56e654992f3a97ab602)