Commit Graph

116 Commits (b6bdd6128310a6c0d12a466e6b3e0a496bf6be3e)

Author SHA1 Message Date
hanna cab8394103 The sharding system now buffers reads, with a size determined by command-line argument. Will investigate whether/how this
impacts performance on low-pass data and, if it works well, will create a more automatic version of the tool.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3709 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:28:55 +00:00
rpoplin 724affc3cc Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 17:37:11 +00:00
aaron b978d5946b adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 08:23:23 +00:00
hanna 48cbc5ce37 Merging the sharding-specific inherited classes down into the base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:36:13 +00:00
hanna 612c3fdd9d First pass at eliminating the old sharding system. Classes required for the original sharding system
are gone where I could identify them, but hierarchies that split to support two sharding systems have
not yet been taken apart.
@Eric: ~4k lines.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 20:17:31 +00:00
bthomas 99b684ea89 Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:10:23 +00:00
aaron 0b03e28b60 updating the tribble library to include the reference dictionary reading / writing. We now check the dictionaries of any tracks that have them against the reference (all new tribble tracks and out-of-date tracks will have this). Also renamed some classes to be more reflective of their function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3485 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 06:34:26 +00:00
aaron 871cf0f4f6 Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of:
@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class))

you'd say:

@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class))

Which is more in-line with what was done before.  All instances in the existing codebase should be switched over.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:52:44 +00:00
hanna 7389077b3b A few misc usability fixes:
- Clarify the message emitted when -XL is supplied so I don't spend another half day chasing a bug that doesn't exist.  
- Crash with a helpful message when running -nt with non-TreeReducible walkers.
- Crash with a helpful message when running -nt with reduceByInterval walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3405 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 19:02:02 +00:00
hanna 017ab6b690 Experimental versions of downsampler and Ryan's deduper are now available either
as walker attributes or from the command-line.  Not ready yet!  Downsampling/deduping 
works in a general sense, but this approach has not been completely optimized or validated.
Use with caution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 05:40:05 +00:00
hanna 73e2e32837 Fix typo.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3369 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:04:00 +00:00
hanna 0791beab8f Checking in downsampling iterator alongside LocusIteratorByState, and removing
the reference implementation.  Also implemented a heap size monitor that can
be used to programmatically report the current heap size.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3367 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:00:44 +00:00
hanna 4e0019b04f Repair code that sorts and merges intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3317 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 22:37:25 +00:00
aaron f497213933 DbSNP moved over to tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3288 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 06:02:35 +00:00
hanna c1e53d407d The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had
unicode quote characters embedded in it.  These characters were invisible inside
IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason
have a different default charset.  Fixed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 15:26:32 +00:00
hanna 1bc26f69e9 An attempt to cleanup the Utils directory. Email to follow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 23:00:08 +00:00
aaron 4d75b26b7a Removing the code that made the ROD system case insensitive. Anyone using specific ROD names in their classes should take care in naming required tracks; All lowercase is the best practice.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3184 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 06:17:31 +00:00
ebanks 3330e254a9 Standardize the dbsnp track name in preparation for case-sensitivity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3176 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 19:41:57 +00:00
aaron e682460c1f add a fix so that XL arguments won't cancel out -BTI arguments, fixed a bug for Ben where the ROD -> interval list conversion was throwing an exception, and some old code removal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3174 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 16:31:43 +00:00
hanna 8573b0bc6f Refactoring intervals, separating the process of parsing interval lists,
sorting and merging interval lists, and creating RODs from intervals.  This
gives Doug the ability to keep using our interval list parsing code when
sorting intervals on our behalf.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3159 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 15:50:38 +00:00
hanna 14b8101d45 Error message fail. Failed to supply one of the valid interval file types.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3153 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 01:19:01 +00:00
hanna 60d54e69f3 Hackish fix to present a better error message if the file does not have the proper extension. Will work with Brett to come up with a better solution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3152 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 01:11:27 +00:00
aaron e148a3ac61 added the ability to create interval lists directly from a ROD, using the command line arg '-BTI' (long name '--rodToIntervalTrackName'). The parameter to this arg is the name of the ROD track, which must be a track name specified in the -B option.
Using this feature, sites covered by the target ROD will be iterated over.  This list of intevals generated is merged with any intervals from the -L and -XL args, and the Walker is run over the resulting merged list.

WARNING: for very large ROD's this can be costly.  Consider this experimental for now.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3134 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 05:14:41 +00:00
bthomas b4f6f54502 Reorganizing the way interval arguments are processed
Most of the changes occur in GenomeAnalysisEngine.java and GenomeLocParser.java: 
-- parseIntervalRegion and parseGenomeLocs combined into parseIntervalArguments
-- initializeIntervals modified
-- some helper functions deprecated for cleanliness
Includes new set of unit tests, GenomeAnalysisEngineTest.java

New restrictions: 
-- all interval arguments are now checked to be on the reference contig
-- all interval files must have one of the following extensions: .picard, .bed, .list, .intervals, .interval_list



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3106 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 12:47:48 +00:00
aaron 3d3d19a6a7 the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date.
a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects.  This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc.  This layer is needed so we can integrate Tribble tracks (which don't natively have names).  Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc).

Sorry for the inconvenience!  More changes to come, but this is by far the largest (as has the greatest effect on end users).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 22:39:56 +00:00
hanna 4fcee248f9 For Kristian: functions which, given a read, can uniquely identify the BAM file storing that read.
Introducing this into the pile of code which peeks under the covers of the SAMDataSource in the hopes
that this function can help to replace the others and provide a single path for crosstalk.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3103 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 20:46:44 +00:00
ebanks 1e8b3ca6ba Fare thee well, oh LocusWindowTraversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3089 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 13:17:26 +00:00
kshakir 20e3ba15ca Added an optional argument -rgbl --read_group_black_list to filter read groups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3079 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-26 19:38:57 +00:00
hanna 78af6d5a40 New sharding system is going live again for on-the-fly merging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3076 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-25 18:39:04 +00:00
hanna 6cd97b78ab An additional safety check to ensure that we only walk over coordinate-sorted
data when doing locus traversals.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3053 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-21 23:31:45 +00:00
hanna b4b4e8d672 For Sarah Calvo: initial implementation of read pair traversal, for BAM files
sorted by read name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3052 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-21 23:22:25 +00:00
hanna 9b61d95d9c Khalid found an out-of-memory condition with the new sharding system when
merging lots of BAMs, and the fix is taking longer than I thought.  Disable
experimental sharding when merging until the fix is ready.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3036 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 02:43:46 +00:00
bthomas 5b34bb9ab0 Adding three minor new features:
+ -L all now walks over all intervals

+ if a -L argument is passed with a .list extension, and file does not exist, returns a \
File Not Found error instead of "bad interval" error. We plan to soon revisit interval \
lists and generate a concrete list of filenames, so this is likely temporary.

+ Error is thrown if the start position on an interval is higher number than the end position.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3021 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-17 16:24:10 +00:00
hanna 2cc040aa1c New sharding system is live. Disable with -ds.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3016 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-17 03:32:45 +00:00
depristo 4dd7c5972c Unit tests for -XL arguments; expt. annotation calculating the GC content within 100 bp of the current SNP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2997 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-14 21:08:14 +00:00
depristo e7eae9b61d High performance, correct implementation of -XL exclusion lists. Enjoy.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2994 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 22:39:20 +00:00
depristo b39b5edca8 Bug fix in variant eval 2. Preliminary (slow and buggy) support for -XL exclude lists.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2991 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 19:23:12 +00:00
aaron dde9fd8a15 some rods-for-reads cleaning and performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 22:54:58 +00:00
hanna a7fe07c404 A few stopgap fixes to get the GATK to the point where the old sharding
infrastructure can be torn down:
1) New sharding system emulates old MonolithicSharding mechanism.
2) Better awareness of differences between fasta and BAM files when creating
   shards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 21:01:25 +00:00
hanna 023654696e First pass at handling SAMFileReaders using a SAMReaderID. This allows us to firewall
GATK users from the readers, which they could abuse in ways that could destabilize the GATK.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2923 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 00:59:32 +00:00
hanna 6133d73bf0 Locus (non-intervalled) traversal with new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2903 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-01 01:58:44 +00:00
hanna 199b43fcf2 Reduce by interval alterations to interface with new sharding system. This checkin with be followed by a
simplification of some of the locus traversal code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2886 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 00:16:50 +00:00
ebanks 32d14d988e Overload parseIntervalRegion() to allow for the interval merging rule to be passed in (so one is not required to use the value from the GATK arg collection).
Now the IndelRealigner can use this functionality without being forced to merge  abutting intervals (which was actually causing a problem with the cleaning).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2862 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 04:13:54 +00:00
ebanks ca1917507f Various improvements and fixes:
In indel cleaner:

1. allow the user to specify that he wants to use Picard’s SAMFileWriter sorting on disk instead of having us sort in memory; this is useful if the input consists of long reads.

2. for N-way-out mode: output bams now use the original headers from the corresponding input bams - as opposed to the merged header.  This entailed some reworking of the datasources code.

3. intermediate check-in of code that allows user to input known indels to be used as alternate consenses.  Not done yet.

In UG: fix bug in beagle output for Jared.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2805 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-07 04:21:04 +00:00
ebanks 83b9d63d59 1. Added functionality to the data sources to allow engine to get mapping from input files to (merged) read group ids from those files.
2. Used said mapping to implement N-way-in,N-way-out functionality in the new indel cleaner.  Still needs more testing (to be done after vacation but preliminary tests look good).
3. Fixes to VCF validator: ignore case when testing VCF reference base against true reference base and allow quals of -1 (as per spec).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2773 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 04:12:49 +00:00
hanna 3d922a019f Basic support for very simple index-driven locus traversals. Interface has been changed to
support batched intervals in a single shard, but intervals are not yet compressed into a single
shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 03:14:26 +00:00
hanna b19bb19f3d First successful test of new sharding system prototype. Can traverse over reads from a single
BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2587 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 03:35:55 +00:00
aaron db9570ae29 Looks bigger than it is:
* Moved GATKArgumentCollection into gatk.arguments folder to clean up the main folder, also added some associated argument classes (most of the changes).
* Added code the argument parsing system for default enums, we needed this so we could preserve the current unsafe flag, and at the same time allow finer grained control of unsafe operations.  You can now specify:

"-U" (for all unsafe operations), "-U ALLOW_UNINDEXED_BAM" (only allow unindexed BAMs), "-U NO_READ_ORDER_VERIFICATION", etc.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2586 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 00:14:35 +00:00
aaron 16777e3875 more fixes for the empty interval list problem; you can now run LocusWindow traversals with an empty interval list, but the GATK will give you a warning (unless you're running in unsafe mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2563 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:47:43 +00:00
aaron 3c5f5177b1 check to see if the parsed interval list is empty, since we now allow interval files that are empty. If so, make sure we default to a non-interval based traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2559 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 17:52:27 +00:00