Commit Graph

162 Commits (73acfa654a3eb3d7e21e91945739fb8d1ab0452e)

Author SHA1 Message Date
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
depristo 5d2c2bd280 Just refactoring into utils/baq directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:43:43 +00:00
depristo a5b3aac864 Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading *lots* of reference data all of the time
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:23:06 +00:00
bthomas 374c0deba2 Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 19:59:05 +00:00
hanna 8e36a07bea Convert GenomeLocParser into an instance variable. This change is required
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.  

GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either

-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()

This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet.  This
will become clearer when contig aliasing is implemented.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 17:59:50 +00:00
hanna eee134baf2 Chris found a bug in the downsampler where, if the number of reads entering
the pileup at the next alignment start is large, we don't add as many of those
incoming reads as we should.  No integration tests were affected.

Thanks, Chris!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4378 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 11:18:12 +00:00
kshakir edaa278edd Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance.
This will allow other programs like Queue to reuse the functionality.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-25 02:49:30 +00:00
depristo 7880863eb7 Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:07:38 +00:00
depristo 7ad8fbdd5a Moved GATKException to exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:47:19 +00:00
depristo 595907e98e Moving StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:34:15 +00:00
depristo 40e6179911 Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:02:43 +00:00
depristo 1de713f354 Massive review of maybe 50% of the exceptions in the GATK. GATKException is a tmp. tracker so that I can tell which StingExceptions I've reviewed. Please don't use it. If you are working on new code and are considering throwing exceptions, it's either UserError or StingException, please
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4246 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 23:21:17 +00:00
depristo 6a30617a60 Initial implementation of UserError exceptions and error message overhaul. UserErrors and their subclasses UserError.MalFormedBam for example should be used when the GATK detects errors on part of the user. The output for errors is now much clearer and hopefully will reduce GS posts. Please start using UserError and its subclasses in your code. I've replace some, but not all, of the StingExceptions in the GATK with UserError where appropriate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4239 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 11:32:20 +00:00
hanna fb177c4fee If only dcov is specified, assume that selected downsample type is BY_SAMPLE.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4147 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-27 17:35:41 +00:00
hanna de5ccfb0b1 Moved hasPileupBeenDownsampled() based on Eric's request. Also eliminated
@Deprecated constructors from AlignmentContext.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4142 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-27 16:12:05 +00:00
hanna d773b3264b Eliminated -mrl option.
Eliminated -fmq0 option.
Eliminated read group hallucination.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4133 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-26 21:38:03 +00:00
hanna 41d57b7139 Massive cleanup of read filtering.
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general.  For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:17:11 +00:00
kshakir 4f51a02dea Changed logging level to default at INFO instead of WARN.
Changes to StingUtils command line for use in Queue, replacing Queue's use of property files.
Updates to walkers used in existing QScripts to add @Input/@Output.
RMD used in @Required/@Allows now has a new default equal to "any" type.
New QueueGATKExtensions.jar generator for auto wrapping walkers as Queue CommandLineFunctions.
Added hooks to modify the functions that perform the Scattering and Gathering (setting their jar files, other arguments, etc.)
Removed dependency on BroadCore by porting LSF job submitter to scala.
Ivy now pulls down module dependencies from maven.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3984 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 16:42:48 +00:00
depristo 7c42e6994f FindBugs fixes throughout the code base
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3823 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 16:29:59 +00:00
hanna 96034aee0e Cleanup for Steve Hershman's issue. In the midst of doing this, I discovered
that the semantics for which reads are in an extended event pileup are not
clear at this point.  Eric and I have planned a future clarification for this
and the two of us will discuss who will implement this clarification and when
it'll happen.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3809 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 18:57:58 +00:00
hanna cab8394103 The sharding system now buffers reads, with a size determined by command-line argument. Will investigate whether/how this
impacts performance on low-pass data and, if it works well, will create a more automatic version of the tool.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3709 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:28:55 +00:00
hanna c9d5345150 Redo StratifiedAlignmentContext to use ReadBackedPileup's stratification options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3699 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 02:46:05 +00:00
hanna 3a9d426ca8 Added hasPileupBeenDownsampled() boolean to ReadBackedPileup, so that a pileup can report whether or not (but not how much) it's been downsampled.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3649 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 04:56:33 +00:00
hanna c806ffba5f Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations
will not be as fast as they could be because the workflow is currently merge sam records (sharding)
-> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split
records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext
is updated to take advantage of the new functionality in ReadBackedPileup.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 02:11:42 +00:00
hanna 1d50fc7087 Misc bug fixes: fix tracking of nInsertions with sample-split pileup constructor. Fix performance
issue building up pileups from pileups of individual sample data.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3598 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-20 20:32:27 +00:00
hanna f18ac069e2 A refactoring / unification of ReadBackedPileup and ReadBackedExtendedEventPileup.
Provides a cleaner interface with extended events inheriting all of the basic RBP
functionality.  Implementation is still slightly messy, but should allow users to
provide separate implementations of methods for sample split pileups and unsplit
pileups for efficiency's sake.
Methods not covered by unit/integration tests have not been sufficiently tested yet.
Unit tests will follow this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3597 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-20 04:42:26 +00:00
hanna 5050b19457 We're unable to make the naive deduper more worldly, so we're killing it instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3587 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 13:54:27 +00:00
hanna 612c3fdd9d First pass at eliminating the old sharding system. Classes required for the original sharding system
are gone where I could identify them, but hierarchies that split to support two sharding systems have
not yet been taken apart.
@Eric: ~4k lines.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 20:17:31 +00:00
hanna c1595a383a More bugfixes for cases where no sample name is present.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3578 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 16:46:02 +00:00
hanna 5972ad1199 Fixes to mrl integration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3573 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 20:40:10 +00:00
hanna e77f76f8e1 Reenabled downsampling by sample after basic sanity testing and fixes of the
new implementation.  Hard testing and performance enhancements are still
pending.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3566 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 17:23:27 +00:00
hanna c3b68cc58d Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept
independent while processing, and only merged back in a priority queue if necessary in a special
variant of the ReadBackedPileup.  This code is not live yet except in the case of naive deduping.
Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and
not well integrated with StratifiedAlignmentContext or the walkers.  Cleanup to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-13 01:47:02 +00:00
hanna f55f32d4ee Bug fix.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3526 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 01:53:26 +00:00
hanna dbee21a50f Bugfixes for the case when no read groups / no samples are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3523 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:47:05 +00:00
hanna 84563b37e5 Partial flattening of the hanger data structure. Hanger data structure is
not currently as flat as it could / should be, but it's already comparable
to the speed of the reference implementation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3512 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 16:28:49 +00:00
hanna c2858c8988 Minor performance enhancement. Checkpoint commit before major performance
overhaul.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3504 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:39:39 +00:00
hanna 199e4208cd Bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3497 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 00:30:33 +00:00
hanna 52ab9f2417 Feature parity between LocusIteratorByState, DownsamplingLocusIteratorByState, including pushing mrl /
the LocusOverflowTracker into LocusIteratorByState.  Note that the 'Matt Hanna exception', is still enabled
because I haven't yet validated the performance of the DownsamplingLocusIteratorByState when running
without downsampling.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3496 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 22:58:21 +00:00
hanna 5c4d070566 Push Mark's changes in LocusIteratorByState into DownsamplingLocusIteratorByState
in preparation for merging the two into one.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3495 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 17:29:30 +00:00
depristo e2b41082af GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 22:26:32 +00:00
depristo 2b02324587 Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-30 18:00:12 +00:00
hanna b10950c691 Simple performance optimization -- cache the number of reads in the locus hanger.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3417 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 19:26:16 +00:00
hanna 388dd8d64d Fixing bugs in downsampler introduced when I added Ryan's dup eliminator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3407 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 02:53:12 +00:00
hanna 017ab6b690 Experimental versions of downsampler and Ryan's deduper are now available either
as walker attributes or from the command-line.  Not ready yet!  Downsampling/deduping 
works in a general sense, but this approach has not been completely optimized or validated.
Use with caution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 05:40:05 +00:00
hanna 0791beab8f Checking in downsampling iterator alongside LocusIteratorByState, and removing
the reference implementation.  Also implemented a heap size monitor that can
be used to programmatically report the current heap size.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3367 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:00:44 +00:00
aaron 2c55ac1374 fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 06:13:26 +00:00
hanna 6868ce988f Fix hanging bug reported by Susanne Pfeifer (tiffy @ get satisfaction) where, if the last read(s) in a shard all have an
indel in roughly the same location and that indel isn't covered by any other reads, LocusIteratorByState goes into an infinite
loop.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3348 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 17:31:19 +00:00
hanna 76efa757f0 Switched over to reviewed version of Picard patch. In process, did some optimization to the IntervalSharder
which improved startup time 5-10x when dynamically merging many BAMs.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3331 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-08 14:12:22 +00:00
hanna 8bb15ef812 Checking in the reference implementation of the downsampler for back comparison.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3278 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 15:41:13 +00:00
hanna 9e107513d0 In the new sharding system, if no read group is present, hallucinate one. Added
for test compatibility, but not sure whether we still need this feature.  TODO: Poll the group about this feature.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2949 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 23:01:34 +00:00