Commit Graph

111 Commits (75a0d0b193b78c645765135fd31c90ec5411952e)

Author SHA1 Message Date
Joel Thibault 01738e70c3 Archive the experimental Active Region Traversals 2013-01-04 17:05:31 -05:00
Joel Thibault 319d651e4a Initial updates for ActiveRegionShard 2013-01-03 17:00:13 -05:00
Joel Thibault e7553545ef Initial updates for ReadShard 2013-01-03 17:00:13 -05:00
Joel Thibault 4cc372f53b LocusShardDataProvider doesn't need its own GenomeLocParser 2013-01-03 17:00:13 -05:00
Mark DePristo b92f563d06 NanoScheduler optimization for TraverseReadsNano
-- Pre-read MapData into a list, which is actually faster than dealing with future lock contention issues with lots of map threads
-- Increase the ReadShard default size to 100K reads by default
2012-12-24 13:35:56 -05:00
Mark DePristo f6d5499582 The GATK engine now ensures that incoming GATKSAMRecords have GATKSAMReadGroupRecord objects in their header
-- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord
-- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string
-- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety
2012-12-24 13:35:09 -05:00
Yossi Farjoun 6ed9eb3da9 GATKBAMIndex now passes unit test! Problem was that SeekableBufferedStream seems to have a bug: it will read beyond the end of a file if asked to. 2012-12-18 17:32:26 -05:00
Yossi Farjoun ea704d688f chose smaller buffer size for the bufferedStream 2012-12-15 13:01:38 -05:00
Yossi Farjoun 6da2338ea7 removed comments and uneeded imports 2012-12-15 12:31:37 -05:00
Yossi Farjoun 19dd2d628a some changes.
some changes.
2012-12-14 17:21:32 -05:00
Yossi Farjoun 5e66109268 Replaced a useless getInt with a skipInt to remove 1/4 of the initial seek time in the BAM Index. 2012-12-12 17:08:11 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Mark DePristo 8020ba14db Minor cleanup of SAMDataSource as part of my system review
-- Changed a few function from public to protected, as they are only used by the package contents, to simplify the SAMDataSource interface
2012-11-30 15:04:41 -05:00
Eric Banks 81532a0529 Missing file are user errors. 2012-10-12 09:48:12 -04:00
Eric Banks ad60300bee Catch malformed BAM files at the source since this is the largest class of errors in Tableau. 2012-10-12 09:07:57 -04:00
David Roazen 3861212dab Fix inefficiency in FilePointer GenomeLoc validation
Validation of GenomeLocs in the FilePointer class was extremely inefficient
when the GenomeLocs were added one at a time rather than all at once.

Appears to mostly fix GSA-604
2012-10-11 19:55:14 -04:00
David Roazen 118e974731 GATK Engine: special-case "monolithic" FilePointers, and allow them to represent multiple contigs
Sometimes the GATK engine creates a single monolithic FilePointer representing all regions
in all BAM files. In such cases, the monolithic FilePointer is the only FilePointer emitted
by the BAMScheduler, and it's safe to allow it to contain regions and intervals from multiple
contigs.

This fixes support for reading unindexed BAM files (since an unindexed BAM is one case
in which the engine creates a monolithic FilePointer).
2012-10-02 15:30:03 -04:00
David Roazen a96ed385df ReadShard.getReadsSpan(): handle case where shard contains only unmapped mates
Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might
end up with a shard containing only unmapped mates of mapped reads. In this case,
ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would
be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped
mates of mapped reads located within those intervals.
2012-10-02 13:50:00 -04:00
David Roazen e740977994 GATK Engine: do not merge FilePointers that span multiple contigs
This affects both the non-experimental and experimental engine paths, and so
may break tests, but this is a necessary change.
2012-09-27 18:02:25 -04:00
David Roazen e82946e5c9 ExperimentalReadShardBalancer: create one monolithic FilePointer per contig
Merge all FilePointers for each contig into a single, merged, optimized FilePointer
representing all regions to visit in all BAM files for a given contig.

This helps us in several ways:

-It allows us to create a single, persistent set of iterators for each contig,
 finally and definitively eliminating all Shard/FilePointer boundary issues for
 the new experimental ReadWalker downsampling

-We no longer need to track low-level file positions in the sharding system (which
 was no longer possible anyway given the new experimental downsampling system)

-We no longer revisit BAM file chunks that we've visited in the past -- all BAM
 file access is purely sequential

-We no longer need to constantly recreate our full chain of read iterators

There are also potential dangers:

-We hold more BAM index data in memory at once. Given that we merge and optimize
 the index data during the merge, and only hold one contig's worth of data at a
 time, this does not appear to be a major issue. TODO: confirm this!

-With a huge number of samples and intervals, the FilePointer merge operation
 might become expensive. With the latest implementation, this does not
 appear to be an issue even with a huge number of intervals (for one sample, at least),
 but if it turns out to be a problem for > 1 sample there are things we can do.

Still TODO: unit tests for the new FilePointer.union() method
2012-09-27 14:47:54 -04:00
David Roazen 0b488cce66 ExperimentalReadShardBalancer: close() exhausted iterators
Fixes a truly awful SAMReaders resource leak reported by Eric -- thanks Eric!
2012-09-24 14:52:59 -04:00
David Roazen f6a22e5f50 ExperimentalReadShardBalancerUnitTest was being skipped; fixed
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.

This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.

Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
2012-09-22 01:56:39 -04:00
David Roazen 133085469f Experimental, downsampler-friendly read shard balancer
-Only used when experimental downsampling is enabled

-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.

-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.

-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.

-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
2012-09-21 22:17:58 -04:00
Mark DePristo d6e42d839c Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly. 2012-09-10 20:14:14 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Mark DePristo 5a9610d875 ReadShards now default to 10K (up from 1K) reads per samFile up to 250K
-- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often
2012-08-30 19:41:49 -04:00
Mark DePristo ce3d1f89ea ReadShard are no longer allowed to span multiple contigs
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads.  The old implementation simply failed in this case.  The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
2012-08-30 10:15:11 -04:00
Mark DePristo 53376b9423 Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- shardSpan is only calculated when there some ROD is live in the GATK.  No sense in paying the cost per read when you don't need it
-- Update contract to allow null span or unmapped span (good catch unittests!)
2012-08-30 10:15:10 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Mark DePristo 972be8b4a4 Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- ReadMetaDataTracker is dead!  Long live the RefMetaDataTracker.  Read walkers will soon just take RefMetaDataTracker objects.  In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
    -- This new implementation produces thread-safe objects (i.e., holds no points to shared state).  Suitable for use (to be tested) with nano scheduling
    -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference.  Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface.  If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
    -- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads
2012-08-30 10:15:10 -04:00
Mark DePristo 8fc6a0a68b Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker 2012-08-30 10:13:06 -04:00
Mark DePristo 63a9ae817a Ensure thread-safety of CachingIndexedFastaSequenceFile
-- Cosmetic cleanup of ReadReferenceView
-- TraverseReadsNano provides the reference context, since it's thread-safe
-- Cleanup CachingIndexedFastaSequenceFile.  Add docs, remove unnecessary setters
-- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.
2012-08-27 12:11:54 -04:00
Mark DePristo fde9824765 Optimizations for parallel read walkers
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
2012-08-25 17:21:12 -04:00
Mark DePristo af540888f1 Limited version of parallel read walkers
-- Currently doesn't support accessing reference or ROD data
-- Parallel versions of PrintReads and CountReads
2012-08-25 17:21:12 -04:00
Mark DePristo be0f8beebb Fixed GSA-434: GATK should generate error when gzipped FASTA is passed in.
-- The GATK sort of handles this now, but only if you have the exactly correct sequence dictionary and FAI files associated with the reference.  If you do, the file can be .gz.  If not, the GATK will fail on creating the FAI and DICT files.  Added an error message that handles this case and clearly says what to do.
2012-08-17 11:49:02 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Joel Thibault 2b25df3d53 Add removeProgramRecords argument
* Add unit test for the removeProgramRecords
2012-08-01 15:33:05 -04:00
Eric Banks ae08d35138 Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly). 2012-07-18 12:57:34 -04:00
Eric Banks dd571d9aa0 Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags. 2012-07-04 01:22:20 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Mauricio Carneiro f51a1d0d61 Better error message to the BAMScheduler
In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.
2012-05-02 16:10:00 -04:00
Khalid Shakir 9801dd114f Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
2012-04-27 09:58:38 -04:00
Mark DePristo b43d21056b Merged bug fix from Stable into Unstable 2012-04-10 09:42:09 -04:00
Mark DePristo 6885e2d065 UserException fixes for GATK_logs recent errors
-- SamFileReader.java:525
-- BlockCompressedInputStream:376

These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Eric Banks 99d27ddcc4 Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now. 2012-04-02 14:27:36 -04:00
David Roazen 91d10431d3 BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.

Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00