Commit Graph

3122 Commits (2b1db519d7775a49eb775e9182879b5e6dee5892)

Author SHA1 Message Date
Joel Thibault 2b1db519d7 Add reads which overstep a boundary by a single base 2012-12-18 18:47:43 -05:00
Joel Thibault 9828b2990f Reads off the end of a contig fail SAM validation when using actual BAMs 2012-12-18 18:47:43 -05:00
Joel Thibault 72e2394b26 Create actual BAM 2012-12-18 18:47:43 -05:00
Joel Thibault d69d1f8988 Fun with varargs 2012-12-18 18:47:42 -05:00
Joel Thibault 1158c1529f Refactor region/read comparisons 2012-12-18 18:47:42 -05:00
Yossi Farjoun 6ed9eb3da9 GATKBAMIndex now passes unit test! Problem was that SeekableBufferedStream seems to have a bug: it will read beyond the end of a file if asked to. 2012-12-18 17:32:26 -05:00
eitanbanks 002ce9c1d5 Merge pull request #8 from yfarjoun/master
Huge speedup in initial traversal of BAM index files (x20 speed!)
2012-12-18 10:16:53 -08:00
Mark DePristo 16eb1c5436 Optimization to TraverseReadsNano
-- Don't just read all inputs into a list, and then provide an iterator to that list, actually make a real iterator so NanoScheduler input thread can contribute meaningfully to the work load
-- Use NanoScheduler progress function, instead of home-grown updater
2012-12-18 10:14:47 -05:00
Mark DePristo b33f804cdc Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls 2012-12-17 16:47:27 -05:00
Mark DePristo 66d32f646b Minor cleanup of BAQ calculation (final variables, etc) 2012-12-17 16:47:27 -05:00
Mark DePristo 67fe81391c ProgressMeter optimization: don't do genome loc formatting, but instead create an object that only formats when printing is actually needed 2012-12-17 16:47:27 -05:00
Mark DePristo 1de2f527b9 Optimization of recalibrateRead
-- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application
-- Trivial comment on how we might use BAQ better in BaseRecalibrator
2012-12-17 16:47:27 -05:00
Mark DePristo bd6cda7542 Trivial optimization of TraverseReadsNano -- don't format the shard toString if logger isn't debug enabled 2012-12-17 16:47:27 -05:00
Mark DePristo a481d006f0 Optimizations for applying BQSR table with PrintReads
-- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops
-- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one
-- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable.  This stops us from recreating an expensive array over and over.  In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping.  Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.
2012-12-17 16:47:27 -05:00
Mark DePristo 5ec25797b3 Optimizations for BaseRecalibrator
-- No longer computes at each update the overall read group table.  Now computes this derived table only at the end of the computation, using the ByQual table as input.  Reduces BQSR runtime by 1/3 in my test
2012-12-17 16:47:27 -05:00
Eric Banks e6f468b647 Refactored the quasi-useful IndelType annotation into the more useful VariantType.
The indels are still annotated as before, but now all other variant types are annotated too.
I'm doing this because of requests on the forum but am not making it standard.  If we find it to be useful we can turn it on by default later.
2012-12-17 11:54:47 -05:00
Eric Banks 762f184262 Bug fix for strict validation: rsID checking wasn't working if there were multiple IDs 2012-12-17 10:32:41 -05:00
Yossi Farjoun ea704d688f chose smaller buffer size for the bufferedStream 2012-12-15 13:01:38 -05:00
Yossi Farjoun 6da2338ea7 removed comments and uneeded imports 2012-12-15 12:31:37 -05:00
Yossi Farjoun 19dd2d628a some changes.
some changes.
2012-12-14 17:21:32 -05:00
Mauricio Carneiro 74344a3871 Bringing in the changes from the CMI repo 2012-12-13 21:59:37 -05:00
Eric Banks 696bf95fba Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer). 2012-12-13 23:28:30 +00:00
Mark DePristo aeab932c63 Actual working version of unflushing VCFWriter
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
2012-12-13 16:15:08 -05:00
Yossi Farjoun 5e66109268 Replaced a useless getInt with a skipInt to remove 1/4 of the initial seek time in the BAM Index. 2012-12-12 17:08:11 -05:00
Eric Banks 62eaffdf0a Fix docs for ReadBackedPhasing 2012-12-12 20:28:04 +00:00
Eric Banks bba63a3b0e Fix for GSA-615: UnifiedGenotyperEngine.getGLModelsToUse takes 5% of the runtime of UG, should be optimized away. 2012-12-12 20:25:45 +00:00
Mauricio Carneiro a52e3c7e15 Revert "Bug fix for RR: don't let the softclip start position be less than 1"
this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT).
DEV-322 #resolve #time 4m

This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.
2012-12-12 13:09:39 -05:00
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Mark DePristo dd52a70d45 Fix AFCalcResult unit test
-- I was simply passing in the wrong values into the function.  Fixed the calls, and expanded the docs on what needs to be passed in.
2012-12-11 10:40:12 -05:00
Ami Levy-Moonshine 6bf31065e3 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-11 10:34:50 -05:00
Ami Levy-Moonshine 2f99569dda change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field) 2012-12-10 22:48:15 -05:00
Ami Levy-Moonshine 2e3284f306 Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true. 2012-12-10 22:23:58 -05:00
Mauricio Carneiro 8a115edbaf ReduceReads is now scattered by contig
It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.
2012-12-10 15:25:27 -05:00
Ami Levy-Moonshine 573ace4403 restore the right version of VariantContextUtils.java in my unstable dir 2012-12-10 10:28:56 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine 5460c96137 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine 3a420d163e (1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012 2012-12-09 23:40:03 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo dbf721968d PrintReads large-scale test to protect against another major low-level performance issue 2012-12-05 21:36:27 -05:00
Ami Levy-Moonshine 5d78a61f7a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:07:12 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Eric Banks 0c925856cb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 02:00:39 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Mauricio Carneiro 30f013aeb0 Added a copy() method for ReadBackedPileups
necessary to create new alignment contexts with hard-copies of the pileup.
2012-12-05 01:32:18 -05:00
Mauricio Carneiro 6feda540a4 Better error message for SimpleGATKReports 2012-12-05 01:32:18 -05:00
Randal Moore 8d2d0253a2 introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-12-03 22:33:02 -05:00
Eric Banks 67932b357d Bug fix for RR: don't let the softclip start position be less than 1 2012-12-03 15:59:14 -05:00
Ryan Poplin a47da9bb2f Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 14:30:14 -05:00
Eric Banks 5fed9df295 Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh). 2012-12-03 12:18:20 -05:00