Commit Graph

4041 Commits (bfbf75fe3e1fbd4429beebe84c3be4e331bb2568)

Author SHA1 Message Date
kshakir 542d394e09 Cleaning up Queue debugging output.
-l DEBUG with local programs now prints out the stdout/stderr of the programs as they are run.
More documentation in the examples with a new even simpler CountReads example.
Took out unused option to build Queue GATK extensions separately.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4025 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:54:08 +00:00
chartl 49a3db9dfe A brief implementation of a QD calculation that is not quite so bimodal for known variants (multiplicatively penalizes QD by (n variant samples)/(n variant alleles) ). Not sure how helpful this will be (which is why it is in oneoffs). Seems nice on MCKD1, but I'm still playing with the optimization.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4024 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:42:37 +00:00
chartl c6a8fba922 Occasionally if a JEXL expression results in no variants being captured (like "QD > 20.0" on filtered variants) the per-sample mapping from samples to eval objects can be empty. This semi-hacky fix prevents null pointer exceptions in setting up the resulting empty table (by jumping straight to it in this case)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4023 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:37:45 +00:00
ebanks f874e548aa Shame on us. FlagStat used ints instead of longs, so we ended up getting negative read counts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4022 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 03:00:57 +00:00
ebanks 71c4d3f33d Moving pointer to b36 reference from /broad/1KG to /humgen/1kg
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4021 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 00:54:34 +00:00
kshakir f39dce1082 Exposed CommandLineFunction defaults to the Queue.jar command line (see -help).
Added ability to skip up-to-date jobs where the outputs are older than the inputs.
Changed -T CountDuplicates --quiet to --quietLocus so that Queue GATK extensions can use both short and full argument names.
Short names can be used to set values on Queue GATK extensions, for example: vf.XL :+= myFile
Moved Hidden from the GATK to StingUtils.
Updated ivy from 2.0.0 to 2.2.0-rc1 to fix sha1 issue: http://bit.ly/aX72w7
Added Queue to javadoc and testing build targets.
Added first Queue unit test.
Another pass at avoiding cycles in the DAG thanks to all function I/O being files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4017 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 21:58:26 +00:00
chartl 8c08f47923 1) Make sure that the table size is set correctly in finalize()
2) Make sure variants are biallelic before asking for isTransversion()



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4016 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:32:22 +00:00
hanna 41d57b7139 Massive cleanup of read filtering.
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general.  For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:17:11 +00:00
ebanks 7385cce494 Useful tool for calculating the perentage of misaligned reads at homozygous non-ref indel sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4013 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 17:57:44 +00:00
ebanks cc9e6b4ad9 Moved into Tribble to be with VC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4012 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 17:14:32 +00:00
aaron 14e492fa80 fix for a problem in readNextRecord() of BFS, where we'd go looking for the next record far into in the next contig because (f.getEnd() >= start) was never true once we cycled to a new conitg. Added a check for contig identity. Also, removed duplicate HW calculation classes in the GATK and Tribble.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4011 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 17:01:38 +00:00
flannick cd4cd6db81 Added option to print out discordant sites in GenotypeConcordance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4006 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 19:55:19 +00:00
flannick 18fc5c8c3e Initial implementation of annotator to compute allele balance for each sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4005 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 19:40:17 +00:00
flannick 1dc373b9d0 Initial implementation of evaluator to compute popgen theta statistics
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4004 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 19:36:34 +00:00
aaron 0a8ebcb4f9 moving tests over from the GATK to Tribble, and added a speed-up to the readNextRecord() that Mark suggested. Also removed the contained flag from the queries to Tribble in the GATK.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4003 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 17:54:59 +00:00
ebanks 3ff6e3404e Alleles are now returned in a consistent order, so we can deal with tri-allelic sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4002 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 15:21:10 +00:00
aaron d514c424fd adding tests for BTI in the ROD validation tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3997 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 06:05:40 +00:00
ebanks ca5b274f16 Unit, integration, and performance tests are all busted, so this is a good time to make a big commit...
Major cleanup of the genotype writer code from the calling end.  UG no longer supports making calls in anything but VCF, and that allows us to use the VCFWriter more generically now.  Putting the ball in Matt's court to finish collapsing everything.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3996 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 04:18:29 +00:00
aaron 0f78f70ed4 fix for feature source in Tribble; we need to check that the record coming back isn't null. Also in the GATK added code to set the default logging level in integration tests to WARN, with the default level change they were spewing a bunch of text.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3995 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 02:57:23 +00:00
ebanks 419a36f74c Starting the clean up of the sting.utils.genotype code which is all either moving to Tribble, moving to sting.utils.vcf, or being removed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3994 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 02:16:05 +00:00
depristo 2a4a4b0aab VariantRecalibrator now calls plot_Tranches directly so it works on the farm
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3993 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 23:17:16 +00:00
depristo c2c0c1f57c Removing used --enable_overlap_filters argument; Eric assures me this won't break the currently broken tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3992 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 22:27:13 +00:00
aaron 0f29f2ae3f fixes for the Tree index, and some small clean-up in the GATK.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3991 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 20:41:50 +00:00
rpoplin 3eee3183fd Checking in the tiger team changes. LOD calculation modified. -qScale is back in case people need it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3990 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 20:41:03 +00:00
ebanks 0eeb659aa3 Useful utility function to print out the Allele as a String since toString prints out * for refs. It was annoying to keep seeing new String(Allele.getBases()).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3989 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 20:35:56 +00:00
chartl d0ecb8875a Added - a class to count functional annotations by sample (currently for the MAF annotation strings, soon to be migrated to genomic annotator once it is up and running)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3988 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 20:09:13 +00:00
aaron 5b0b9e79ba protect against nulls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3987 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 19:21:39 +00:00
depristo 8944800f60 Minor refactoring for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3986 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 18:05:23 +00:00
kshakir 4f51a02dea Changed logging level to default at INFO instead of WARN.
Changes to StingUtils command line for use in Queue, replacing Queue's use of property files.
Updates to walkers used in existing QScripts to add @Input/@Output.
RMD used in @Required/@Allows now has a new default equal to "any" type.
New QueueGATKExtensions.jar generator for auto wrapping walkers as Queue CommandLineFunctions.
Added hooks to modify the functions that perform the Scattering and Gathering (setting their jar files, other arguments, etc.)
Removed dependency on BroadCore by porting LSF job submitter to scala.
Ivy now pulls down module dependencies from maven.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3984 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 16:42:48 +00:00
aaron 30178c05c5 providing a way to specify how you'd like -BTI combined with your -L options; set BTIMR to either UNION (default) or INTERSECTION.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3983 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 14:00:52 +00:00
hanna 6b4a1e3b9f Reenabling code that was commented out after it was confirmed to work by many participating in this thread:
http://getsatisfaction.com/gsa/topics/error_thrown_when_reading_reference_file


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3981 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 00:12:09 +00:00
kiran 48e311a5ea Added copyright notice.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3980 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 07:11:51 +00:00
kiran 9aa70d9c7c Replaced by SelectVariants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3979 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 07:07:42 +00:00
kiran 758ab428f5 Better logging info for the samples being selected and the sample expressions being ignored.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3978 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 07:03:37 +00:00
kiran e242a8f143 Put single quotes around the regex. This isn't strictly necessary through the integration test machinery, but *is* necessary at the console, and it's convenient to be able to cut and paste this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3977 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 05:56:57 +00:00
kiran 13f29660bb Integration test for SelectVariants. Tests a complex case with an explicit sample selection, sample selection by regex, exclusion of non-variant and filtered loci, and JEXL selection on low allele-frequency variants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3976 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 05:49:47 +00:00
ebanks 637a1e5055 Updating to use the new VA interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3975 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 05:31:01 +00:00
ebanks bd6d5a8d51 Adding command-line header to VA and VF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3974 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 05:21:15 +00:00
kiran 64446f0ddf Avoid NaNs in the final output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3973 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 05:16:52 +00:00
ebanks 3f6e44dc71 Updated recalibrator and cleaner to output full command-lines in the bam header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3972 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 04:39:18 +00:00
kiran 0da0dfa1da Cosmetic change - lower-case for all command-line arguments' short names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3971 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 04:12:01 +00:00
kiran eb1bb94d1c Moved the evaluation of the JEXL expressions to a point *after* the samples are subset and the INFO-field annotations are updated. I think this makes more sense than having the evaluations happen beforehand, since it seems jarring to have the JEXL expressions operate on the annotations before they're updated, and have the file contain the annotations after they're updated. Now, selecting on something like allele frequency will actually apply to the annotations that actually end up in the file, while selection on other annotations (which are carried over without modification) will act exactly the same regardless.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3970 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 04:09:02 +00:00
ebanks 594b7912f1 Added a generic method for returning the complete command-line used when calling a walker, to be used in the bam/vcf headers. As requested, every possible engine/walker argument is included. I've added it to the Unified Genotyper output, so people can try it out and let me know what they think. Something that needs to be discussed in group meeting: what happens when we merge VCFs? Do we keep all of the command-lines?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3969 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 03:53:07 +00:00
kiran 6e389059cf An improved version of VariantSubset and VariantSelect, meant to replace those walkers. Takes in a VCF and creates a subsetted VCF by sample(s), JEXL expressions, or both.
When subsetting by sample, the -SN argument is treated as a literal sample name and, if no match is found, as a regular expression.  This allows for a large number of samples to be selected at once (useful when, for instance, cases are given one sample name prefix and controls are given another).

After the subsetting procedure, the INFO-field annotations AC, AN, AF, and DP are all recalculated to properly reflect the new contents of the VCF.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3968 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-08 02:57:06 +00:00
ebanks ac4699a650 Re-enabling this test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3962 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 20:20:37 +00:00
depristo f275041b1c -minimalVCF for CombineVariants. Work around for broken locking code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3960 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 16:10:59 +00:00
aaron 9076c0b28b removing unused code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3958 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 14:24:39 +00:00
ebanks 341e752c6c 1) AlleleBalance is no longer a standard annotation, but the Allelic Depth (AD) is for each sample.
2) Small fixes in the VCFWriter:
a) Trailing missing values weren't being removed if their count was > 1 (e.g. ".,.")
b) We were handling key values that were Lists, but not Arrays.  We now handle both.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3956 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 12:05:14 +00:00
aaron c68625f055 Fixes from Mark for the MutableContexts; this fixes the clearGenotypes() and the clearFilters() methods, and adds a method to clear the attributes. Also added is a method for creating a variant context where the attribute list is pruned to a specific subset, which can be null.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3955 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 22:39:51 +00:00
aaron 72ae81c6de VariantContext has now moved over to Tribble, and the VCF4 parser is now the only VCF parser in town. Other changes include:
- Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from 
inside the tribble directory.
- Hapmap ROD now in Tribble; all mentions have been switched over.
- VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc.
- VariantContext.getSNPSubstitutionType is now in VariantContextUtils.
- This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN

I'll send out an email to GSAMembers with some more details.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 18:47:53 +00:00
fromer b21f90aee0 Added preliminary framework for performing short-range phasing (ReadBackedPhasingWalker.java)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3953 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 14:56:34 +00:00
rpoplin a8d37da10b Checking in everyone's changes to the variant recalibrator. We now calculate the variant quality score as a LOD score between the true and false hypothesis. Allele Count prior is changed to be (1 - 0.5^ac). Known prior breaks out HapMap sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3952 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 14:12:19 +00:00
ebanks 07addf1187 Fix for Kiran: since the Variant Annotator will re-annotate on top of existing annotations it makes sense to remove old headers if they conflict with the definitions being added by VA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3951 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 06:44:39 +00:00
ebanks 1539791a04 Fix for Kiran: when using VCFs for the comp tracks in the Annotator(s), don't put the headers from them into the output VCF.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3950 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 04:45:47 +00:00
ebanks 227c4b10f0 Bug fix for Chris: convert comp tracks to VC so that we can respect the filter field. Added an integration test to cover this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3949 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 04:13:16 +00:00
ebanks 84ca2f27bb Bug fix for Chris: added method createPotentiallyInvalidGenomeLoc() to the GenomeLocParser that doesn't check that the contig exists in the sequence dictionary. This is crucial for lifting over from one reference to another, as sometimes contigs names change in the liftover (e.g. chrM to MT).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3948 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 03:19:02 +00:00
ebanks f247cbf68e I want to be the first to use the new super-cool Hidden annotation! No more telling people not to use the cleaner debugging options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3947 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 02:44:37 +00:00
hanna 78bfe6ac48 Added @Hidden annotation, a way to deliberately exclude experimental fields and
walkers from the help system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3946 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 02:26:46 +00:00
chartl 82d6c5073b A simple read strand filter for potluri on get satisfaction
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3945 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 23:23:50 +00:00
asivache d53d5ffbf6 A utility class that computes running average and standard deviation for a stream of numbers it is being fed with. Updates mean/stddev on the fly and does not cache the observations, so it uses no memory and also should be stable against overflow/loss of precision. Simple unit test is also provided (does *not* stress-test the engine with millions of numbers though).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3944 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 21:39:02 +00:00
ebanks 8d8acc9fae Moving G's MyHapScore to replace the old HapScore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3943 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 21:00:54 +00:00
ebanks 7858ffec32 Spit out the error in the warning message so that Sendu can tell me what his problem is
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3942 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 20:40:28 +00:00
delangel 86211b74e8 Bug fix: when padding alleles in creating a Variant context from an indel, leave no-call alleles as no-call alleles.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3940 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 19:51:10 +00:00
chartl 38e65f6e1b Added: A VariantEval module that gives simple metrics by sample, an an abstract class that makes per-sample modules easy to write (but a little bit clunky since a class needs be defined for each data point -- see SimpleMetricsBySample as an example). AnalysisModuleScanner needed a slight update to pull in data points from parent classes for this to work (thanks Khalid for showing me how to do this). After a code review with Aaron (thanks) and ensuring integration tests pass, I am committing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3939 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 19:37:39 +00:00
hanna f13d52e427 Attempt to determine whether underlying filesystem supports file locking and
disable on-the-fly dict and fai generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3938 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 19:28:27 +00:00
ebanks 340bd0e2c1 Removed hard-coded pointers to references
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 17:59:37 +00:00
asivache a47824d680 A couple of type specific implementations of a single extend() method: takes an array (byte[] or short[] currently) and "extends" it to the left or to the right by the specified number of elements. Returns newly allocated array, with the content of original array copied in (if we extend by n elements to the left, then the returned array will have n default-filled elements *followed* by the content of the old array).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3932 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 15:30:48 +00:00
asivache 012a7cf0a5 mismatchCount now has a version that counts mismatches only along a part of the read (takes additional args start_on_read and length_on_read to specify the read's subsequence to be interrogated);
isMateUnmapped() convenience shortcut method added.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3931 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 15:27:35 +00:00
delangel e6e8a20a1e 1) Fix MyHaplotypeScore to ignore 454 reads, since all those pathological non-existing indels make some sites' score blow up. If a site is only covered by 454 reads, we (hopefully) detect this graciously and just emit a score of 0.0 for the site.
2) New annotation SByDepth = log10(-StrandBias/Depth) (non-standard annotation, key name = "SBD"). If StrandBias/Depth happens to be positive (very rare but can happen), annotation gets value=-1000. 
3) Abstracted out new class AnnotationByDepth so that QD and SBD can share code.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3930 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 15:23:08 +00:00
ebanks bf60ed0b25 Needed it here too: warn user instead of dying if the R script cannot be executed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3929 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 13:11:27 +00:00
ebanks 40ffe34686 Warn user instead of dying if the R script cannot be executed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3928 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 13:08:15 +00:00
ebanks 17d5e89734 Now --list annotates which modules are Standard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3927 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-03 21:00:37 +00:00
ebanks 72875cf717 Removing annoying printouts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3926 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-03 19:55:00 +00:00
ebanks 2307bed742 VariantEval now uses the "standard" modules only by default. You can add other modules with the -E argument and not use all of the standard ones with -noStandard (they can be added back individually with -E).
Generalized some of the packaging code from VariantAnnotator.  Matt might want to take a look to make this nicer...?



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3925 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-03 16:51:10 +00:00
ebanks a7ff9caf54 Added sanity check against bad people and/or crazy big indels at edges of ref context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3918 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-03 05:37:17 +00:00
hanna 5f1b67c1de Coping out and forcing the entire GATK (and associated JVM) to use US English
locale.  Method to force JVM into proper locale exists in CommandLineProgram
and is disabled by default, but implementers of CommandLineProgram can opt in
to the forced US locale by calling a static method.

Question for the VCF developers: I removed the code to explicitly output doubles
in US locale.  Do you / how do you want to handle this in applications that use
Tribble outside the GATK?


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3917 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-03 03:48:26 +00:00
chartl 2bc69572cb Make transcript2info capable of handling b37/hg19 contigs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3915 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-02 17:32:08 +00:00
depristo c203e0fb02 Added JEXL support for hetCount, homRefCount, and homVarCount in VCs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3914 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-02 12:24:11 +00:00
depristo 7fab5c0a8f support for -singleton_fp_rate arguments to variant recalibrator instead of the pop.gen. AF prior. Worth experimenting with Ryan.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3913 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-31 21:17:47 +00:00
ebanks 6d91cd587e Be explicitly clear about which options are for debugging purposes only and shouldn't be used if your username is not ebanks@broad. If only we had a @hidden annotation option for args...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3909 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-30 14:18:31 +00:00
depristo ac8048f17b Support for automated selects for tranches in variant eval -- use -tf to make tranch-specific ve outputs. ApplyVariantCuts with tranche reading functions for general use, along with todo for ryan. CombineVariants now has --filteredAreUncalled and will treat filtered snps in input VCFs are uncalled, and so won't emit -filteredInOther set features
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3908 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-30 14:16:43 +00:00
chartl 9231d13252 Minor modification: adding an argument to make slightly more general.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3907 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-30 05:20:20 +00:00
chartl db54d63fc7 Hahaha yes, ownage. This now works.
BTW, Eric, thanks for forwarding the DepthOfCoverage thread to gsamembers. I'd forgotten about reduce by interval. Mighty helpful in this case!




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3906 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-30 04:23:02 +00:00
chartl 3e3f8c7692 Simple count intervals walker, as per my recent email to GSAMembers. Never use this. It doesn't behave the way you think it does.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3905 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-30 03:39:23 +00:00
delangel ba1a330293 Corrected location and made more explicit the error message thrown if someone tries to read a VCF 3.3 file with indels, which is not supported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3901 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-29 20:02:47 +00:00
delangel 5af986e0c1 Add an integration test for Beagle (one for ProduceBeagleInput and one for BeagleOutputToVCFWalker)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3897 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-29 18:49:22 +00:00
delangel e1a34685fd Add back MyHaplotypeScore as a new implementation for HaplotypeScore, this time as a non-standard annotation. Implementaiton is also better, it computes better consensus haplotypes, ranks them by sum of quality score.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3890 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 21:23:19 +00:00
hanna 6c93b13428 A Java sizeof, implemented using the Java instrumentation API. Can either get the memory consumed either only by a single
object or by a single object and all the references it contains.  Requires a command-line change to add a Java agent to
the command-line; see the Sizeof.java javadoc for details.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3889 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 18:44:15 +00:00
rpoplin f5566a6593 Knocking out some quick findBugs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3887 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 14:10:59 +00:00
delangel 894623858d OK, bad idea to add new temporary annotation - revert to keep integration tests hapy.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3886 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 12:07:13 +00:00
delangel 71bfb1ee35 First redesign of HaplotypeScore - now, a different approach is taken to build possible haplotypes at a site: first, all possible haplotypes consistent with reads are formed (reference is not used). After this list has been formed, it is ranked according to the number of reads that are consistent with it and the two most popular haplotypes are chosen.
this reduces to the old method in typical cases, but it builds haplotypes correctly if there are two variants close by within a context window.

Annotation is temporarily named MyHaplotypeScore so it can be run in parallel with old one, soon it will be renamed after some more testing.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3885 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 10:54:56 +00:00
delangel cffebcc867 Small utility walker used for production of the Beagle data processing paper section. Walker will print out to output file, for every site common to a reference vcf and an eval vcf, a given sample's depth, hapmap AC and AF and pre/post Beagle genotype as well as corresponding reference (e.g. Hapmap) genotype.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3884 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 03:00:17 +00:00
ebanks 1d9ed1e214 Cleanup of old VCFRecord code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3883 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 02:56:47 +00:00
ebanks 7dd55fbf13 Archiving
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3882 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-27 02:47:18 +00:00
aaron 9667942e52 fix for Ryan's issue: we also need to sync when we store a resource.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3881 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-26 22:17:47 +00:00
hanna 8b072b59e2 Returning index dumping functionality in BAMFileStat to a useable state.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3880 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-26 20:03:50 +00:00
depristo 19ad44d332 Minor improvements to CombineVariants to handle the complex case from Chris. IntegrationTest of complex case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3876 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-25 13:46:11 +00:00
ebanks 7c5a3836db Trivial changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3875 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-25 04:00:47 +00:00
ebanks 56de475f11 Based on feedback from non-GSA users, who claim that our exceptions are 'scary and overwhelming,' I've cleaned up the error message to first describe the error and what users should do and then ask them to copy the subsequent stack trace into their GetSatisfaction posting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3874 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-25 03:57:44 +00:00
ebanks 9bd8a2685b Because the performance tests were busted on LSF, no one caught this error until now: when Matt changed over the contract for the AlignmentContext, this line needed to get updated too. All is well now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3873 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-25 02:53:01 +00:00
depristo b551eaf8fd Actually commit the code that makes variant eval run in a reasonable amount of time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3872 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-24 17:32:03 +00:00
depristo b0b37c3476 No handles (I believe) reference only VCs correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3871 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 23:09:23 +00:00
depristo e21376219d Updates to CombineVariants for Tim. -setKey can be null. Integrationtests for -setKey foo and -setKey null.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3870 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 22:35:52 +00:00
delangel 26bb1cd9ce Fix broken test correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3869 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 20:47:41 +00:00
delangel 4fc1db7aaf Change interface to VCFWriter add() method to take only 1 byte from reference (since that's the only thing it needs), to prevent bugs like having people call it with ref.addBases() which is wrong (since it provides bases starting from the left of reference context window).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3868 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 20:24:03 +00:00
aaron b3fd145161 fix for a bug deep in the tribble indexing: if you had a single record in the first contig, the second contig's index blocks would point to the wrong file seek location, and you'd see no
features in that contig. Thanks to Mark for finding this.  I'm not rev'ing the index version (which would cause all indexes to be rebuilt), since this seems like a pretty rare edge case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3865 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 18:39:55 +00:00
depristo 33090629ea VariantEval can now see the EvaluationContext group objects, so they can decide if/when to print interesting sites. GenotypeConcordance has a hard-coded option to print FNs that is on the way to being generally useful. VCFWriter now uses the US locale for formatting floating point numbers; I believe this fixes a long-standing annoyance. Italian guys will check on this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3864 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 17:16:50 +00:00
delangel 5eef15cfdf a) Bad bug fix to CombineVariants: when indels were being merged, the reference base provided was wrong - ref.getBases()[0] was being used, but this returns bease at start of window. Instead, the reference at current locus should be used.
b) Cosmetic change to Beagle annotation description.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 15:13:47 +00:00
ebanks 4ff8b8fc0e 1. Fixing a bug that Mark found where indel-containing clipped reads would get an original cigar tag even when they didn't actually get modified.
2. Added some useful logging messages.
3. Added a oneoffs walker to calculate the number of realigned reads and intervals containing them.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3860 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 14:24:01 +00:00
chartl 973934f769 Depth of coverage now uses longs rather than ints. We can now successfully run on the Lepidosiren paradoxa genome. (about 80 GB)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3859 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 14:14:12 +00:00
depristo 536399eaa0 Improvements to variant combine. Now calculates AC/AN/AF correctly by calling into the VariantAnnotator engine. Automatically removes annotations that are inconsistent across incoming VCs (in simpleMerge). TODO bug fix for Guillermo/Eric.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3858 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 13:33:11 +00:00
aaron 9579aace1f updates to code dependent on Tribble, as well as the following Tribble changes:
- makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it)
- removed some system.out debugging code
- fixed version checking in interval tree 
- made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file)
- index creators now read the file before creating the index
- changed the Index.write() method to take a LEDataStream instead of a file
- removed the sequence dictionary code on the header
- added utils for getting LEDataStreams
- added a base Tribble exception




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 01:56:10 +00:00
ebanks c5325b03be 1) Removed hard-coded strings. Please let's use the fields defined in VCFConstants.
2) General code cleanup.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3856 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 01:49:47 +00:00
hanna e9d243babb More improvements to exception handling during multithreaded runs based on
a bug reported by Ryan.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3855 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 22:13:01 +00:00
hanna 83798225ac Repackaged datasource-specific command-line tools into their own package. Added a tag renamer tool.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3854 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 19:50:34 +00:00
delangel 98caedb5f0 Forgot to update VCF4 unit test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3853 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 16:25:51 +00:00
asivache 485023ba8e this.intersect(that) method added to GenomeLoc (returns intersection of two intervals or dies if the locations do not overlap)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3852 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 16:00:30 +00:00
asivache 3308d956f4 Added utility shortcut method: getOriginalQualsInCycleOrder(read)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3851 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 15:44:25 +00:00
delangel 473ec91633 a) Bug fix in VCFHeader parsing - Info fields were not being parsed properly, with the result that the Count field was not being properly displayed in records (e.g. if Count=0 for a particular field, the INFO tag was still being displayed as ...;Field=x;... instead of ...;Field;...
b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output.
Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's.

Pending issues:
- PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested.
- Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 02:36:45 +00:00
chartl b696c3ea98 No more traversal reduce results.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3849 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-21 18:34:54 +00:00
chartl 365b42390d Support for generating (very basic) wiggle files for use with IGV (see UCSC for wiggle spec); and a walker to take in a variant track and create a transition transversion rate track for the whole genome (due to the wiggle spec, this has to be done by chromosome). It's interesting to see the effect of genes!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3848 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-21 18:04:30 +00:00
depristo f7957bc7f2 Fixed memory leak in VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3845 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-21 12:35:46 +00:00
aaron 1cba81c16f updates to tribble with fixes for some bugs I've found in some new indexing code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3842 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 22:08:04 +00:00
ebanks ff6748d1cd oops - missed one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3841 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 18:55:19 +00:00
ebanks c6ad26e04f 1) When quals/GQs are really integers (x.00), strip off the floating points.
2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output.
3) No more "ID" hard-coded all over the code to set the VariantContext ID.  Use a static variable instead.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 18:01:45 +00:00
ebanks 0db7fab1a9 Fixing genotype filtering for VF and adding integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3839 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 07:30:21 +00:00
aaron 2a6c2d3098 re-enable test; I was moving the input file in prep for my last commit around on Eric, so he rightfully removed the test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3838 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 07:14:59 +00:00
aaron 0108517b98 updating the Tribble track loading code to use the new shared locks, updated lots of new tests, add infrastructure for the TreeInterval, and removed the old locking class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3837 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 07:08:10 +00:00
ebanks f742980864 1. Refactoring of GenoypeWriters so that parallelization now works again with VCF4.0. We now have just a single reference to the old VCF classes, and that one will be purged soon.
2. Moved Jared's VCFTool code into archive so that everything would compile.
3. Added the vcf reference base (needed for indels) as an attribute to the VariantContext from the reader.
4. TribbleRMDTrackBuilderUnitTest was complaining that a validation file didn'r exist, so I commented it out.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3835 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 06:16:45 +00:00
depristo 70b07206a2 CombineVariants tests for Guillermo and Eric to explore the correctness of the in/out reader, writer behavior of the system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3834 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 22:41:48 +00:00
depristo c47a5ff5ab Official parallel CountCovariates, passes all integration tests. Now poster-child example of parallelism in GATK (Matt H). Apparent general performance improvements throughout too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3833 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 22:13:18 +00:00
rpoplin 0b56003d1a Remove stray commented out line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3832 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 19:14:39 +00:00
rpoplin 8e31c01680 Solid processing in base quality recalibrator now has several options for how to handle no calls in the color space. --ignore_nocall_colorspace is removed and replace by --solid_nocall_strategy. Fixed some of the @Deprecated tags in BaseUtils. LocusWalkers now filter out FailsVendorQualityCheck reads. HLA caller integration test bam file had bad vendor reads so its integration test changed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3831 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 19:10:29 +00:00
aaron 18b0114e25 remove FixBAMSortOrder walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3830 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 17:27:23 +00:00
aaron f4cfb0f990 The first step in integrating Jim's tree based index scheme:
- changed to a better method for getting headers from Codecs
- some removal of old commented out code in the GATKAgrumentCollection
- changes for the rename of FeatureReader to FeatureSource
- removed the old Beagle ROD
- cleaned up some of the code in SampleUtils

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 04:49:27 +00:00
hanna 40a963541d Uniquify the registered MXBean by adding an instanceNumber=... tag to the
ObjectName.  In the Queue-enabled future, we might want to come up with GUIDs
(or at least semi-unique IDs) so that we could use JMX to track runtime
attributes for multiple jobs running simultaneously.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3825 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 00:58:54 +00:00
ebanks 5a1a3fc79a Fix bad VariantContext creation in unit test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3824 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 20:21:01 +00:00
depristo 7c42e6994f FindBugs fixes throughout the code base
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3823 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 16:29:59 +00:00
ebanks 693672a461 Refactoring the VCF writer code; now no longer uses VCFRecord or any of its related classes, instead writing directly to the writer. Integration tests pass, but some are actually broken and will be fixed this week.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3822 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 13:19:56 +00:00
ebanks 379584f1bf Re-enable (most of) these tests. Guillermo will re-enable the other one when the VCF->VC conversion is done for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3821 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 03:24:28 +00:00
ebanks 982947d328 update to deal with partial indels (I/D with no bases) in the HM records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3820 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 02:56:37 +00:00
depristo 414ec6f20a Removing version argument constructors that shouldn't be used. Temporary allow -- with global variant to indicate this should be removed -- header records without description fields. Real error checking in the headers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3818 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-17 22:30:08 +00:00
depristo 14b21e487b always 4.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3817 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-17 22:28:48 +00:00
depristo d40299840c indenting clean up
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3816 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-17 22:28:28 +00:00
hanna 9207c58b8f A fix for the integration test I broke on Friday on my way out the door --
some workflows using AlignmentContext were working with it in a way I didn't
expect and wound up treating extended pileups as base pileups.  I'll work to
make sure the AlignmentContext interface is crystal clear.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3815 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-17 22:22:44 +00:00
delangel 55b756f1cc First step in major cleanup/redo of VCF functionality. Specifically, now:
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. 
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.

Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 22:49:16 +00:00
chartl 75bea4881a Modified SampleFilter to allow for multiple samples to be given. AminoAcidTransition now turns on when you give VariantEval the right commands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3812 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 21:27:32 +00:00
aaron 36ac73cf9a comment out broken test until it can be fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3810 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 20:04:40 +00:00
hanna 96034aee0e Cleanup for Steve Hershman's issue. In the midst of doing this, I discovered
that the semantics for which reads are in an extended event pileup are not
clear at this point.  Eric and I have planned a future clarification for this
and the two of us will discuss who will implement this clarification and when
it'll happen.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3809 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 18:57:58 +00:00
asivache 6aedede7f3 Added Type.MNP to allowed variant context types; this does not break the tests (yet)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3808 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 15:50:25 +00:00
asivache 1dd8a28a5d Added new query: isMNP(feature); returns true if dbsnp feature is multi-nucleotide polymorfism (e.g. a di-nuc TA ->CC)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3806 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 15:32:10 +00:00
aaron ec94cfdf05 remove unit test for VCF writer, it's not applicable now that we produce only VCF4. Guillermo, it's up to you if you want to adapt this or remove it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3803 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 14:33:25 +00:00
depristo b29eda83bb Parallelized CountCovarites! percent_ref_called_var now a standard genotype concordance module (for validation!). Really much smarter merging of headers for combineVariants. VCF codecs now actually look at the file version and blow up if they are the wrong versions. setHeaderVersion() in VCFHeaderLine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3802 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 14:10:18 +00:00
ebanks f293eb7de1 Fix for Kim: for some ungodly reason, I was initializing the bins that were maintaining counts to 1 instead of 0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3801 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 03:40:29 +00:00
ebanks e7e58d7129 The SAM spec has now officially reserved my new tags for original cigar and original alignment start... except that OS has been named OP ('original POS')
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3800 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 00:09:36 +00:00
ebanks ab84ed8c68 Fix for Mark: get rid of old program tags whose IDs clash with the recalibrator/realigner tag (including if the id has a .1 at the end, etc.). Keeping them around is dangerous because we don't know which one refers to the latest run of the tool on the bam.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3798 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-15 19:13:50 +00:00
hanna dfddf8fd75 - Bring the PaperGenotyper up to code.
- Remove some old debugging cruft regarding handling of threaded engine exceptions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3796 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 22:31:21 +00:00
bthomas f65cba6b9a Adding support for shared file locking via a new class for file locking, FSLockWithShared. This will eventually take over for FSLock, the current file locking class - I'll work with Aaron to merge the tribble code that uses FSLock right now.
FYI: creating an exclusive lock on a file that does not exist will create that file as an empty file, and will NOT delete that file after the program terminates. So watch out if it's possible that the file you're locking does not exist - could end up leaving extra files that confuse users.  



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3795 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 20:45:51 +00:00
hanna a8caa20378 Previously the hierarchical microscheduler defensively coded around and reported exceptions of
the walker itself, but didn't do a great job of catching framework exceptions.  This became extremely
unfortunate in the case where walkers caused exceptions that manifested themselves in the framework,
such as when the walker opens more files than file handles are available.

Reworked the exception handling so that framework errors are treated like walker errors and the resulting
exception bubbles out of the walker.  Stack traces for threaded walkers are still convoluted and nasty.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3794 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 20:34:43 +00:00
ebanks bf384f48e1 Reverting previous change because it won't always work. More investigation needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3793 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 19:13:17 +00:00
ebanks e4bfb06888 Check header type instead of rod type, since rod type will now be VC and not VCF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3792 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 19:10:09 +00:00
ebanks 0226412b11 Add GQ to list of genotype attributes for reg exp
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3791 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 19:01:11 +00:00
ebanks 78a4d8ec3d Removing more references to VCFRecord
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3790 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 16:34:15 +00:00
ebanks af23762778 Removing more references to VCFRecord
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3789 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 11:54:23 +00:00
ebanks a4f8d70d8d oops, forgot to update this integration test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3788 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 11:38:33 +00:00
ebanks 460283f6d2 No more manually converting VariantContexts to VCFRecords. You should be utilizing VCs and not VCFRecords.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3787 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 05:21:28 +00:00
ebanks 6b5c88d4d6 The GATK no longer writes vcf3.3; welcome to the world of vcf4.0. Needed to fix a few output bugs to get this to work, but it's looking great. Much more still to come. Guillermo: hopefully this doesn't break your local build too badly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3786 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 04:56:58 +00:00
chartl 9d2a485532 Update to AminoAcidTransition eval module
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3783 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 17:12:03 +00:00
rpoplin 3db7fbb5e9 Fix for added EOF in csv file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3781 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 16:09:48 +00:00
ebanks 9a05e8143d Move to 4.0 and away from VCFRecord.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3780 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 15:54:54 +00:00
ebanks 6442dabf94 Deleting/archiving as instructed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3779 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 15:23:50 +00:00
ebanks 7e7da75d27 Moving over to 4.0 and away from VCFRecord
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3778 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 14:07:10 +00:00
ebanks d896d03554 Moving VF to vcf 4.0. Still need to fix genotype filters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3777 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 11:39:51 +00:00
ebanks 76b3b39720 Technically, Mark broke this with his commit earlier. But since I had an outstanding broken test, I lose and have to fix this one too...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3776 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-13 03:58:38 +00:00
ebanks 1bef7dd170 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3775 348d0f76-0448-11de-a6fe-93d51630548a 2010-07-13 00:56:12 +00:00
depristo de969f7cc7 logger != null check
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3774 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 23:07:14 +00:00
depristo 2e445262f2 Promotion to . for variable numbers of arguments
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3773 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 22:53:53 +00:00
delangel 297f15a60c Protect ProduceBeagleInputWalker against evil users who feed to it VCF's with indels, no variation sites or other interesting markers: Write to Beagle input only in biallelic SNP sites since that's the only thing Beagle can do.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3772 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 20:54:42 +00:00
ebanks 52c534a8f2 Updating to VCF 4.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3770 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 20:18:30 +00:00
delangel 5992b79159 a) Simplify normalization code in ProduceBeagleInputWalker, as to always normalize, and use MathUtils.normalizeFromLog10 to do this.
b) Several improvements to BeagleOutputToVCFWalker:
1. If a Hapmap input track is provided (e.g. -B comp,VCF,file), Hapmap sites will be annotated with Hapmap Allele count and allele frequency (key ACH, AFH).
2. If probability of correct genotype is lower than ncthr (optional argument provided by user, default = 0.0), walker will keep original calls instead of using Beagle calls.
3. Instead of annotating just whether Beagle had modified a site, annotate instead HOW MANY genotypes in a site were actually changed by Beagle.

All three improvements are mostly for debugging and analysis only.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3769 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 19:54:58 +00:00
ebanks e50627a49e 1. Updated tests and added integration test for liftover code.
2. Updated liftover code (and scripts) to emit vcf 4.0 and no longer depend on VCFRecord.
3. Beagle walker now also emits vcf 4.0.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3767 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 17:58:18 +00:00
ebanks 2a7112302a More archiving
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3766 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 17:04:41 +00:00
ebanks 221e01fb27 deleting/archiving as instructed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3765 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 16:59:45 +00:00
ebanks 8086ab1f75 Pulled sample/header merging routines out of CombineVariants and into util classes. Added more generalized methods for retrieving samples. Updated the Beagle walkers to use these methods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3764 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 16:51:54 +00:00
ebanks 0c4a32843c No longer uses VCFRecord
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3763 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 13:57:39 +00:00
ebanks f130d29318 No longer uses VCFRecord.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3762 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 13:34:10 +00:00
ebanks e75b3e13bd updating unit test for previous fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3761 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 03:23:53 +00:00
ebanks 0427f3554b Bug fix: valid fields were being stripped off the FORMAT for samples because String.match was used instead of String.equals. Also, please use VCFConstants from now on instead of hard-coding e.g. missing values into the code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3760 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-12 03:06:51 +00:00
ebanks fb717fe128 First pass needed to remove old VCF code: moving all VCF-related constants into a single unified class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3759 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-11 07:19:16 +00:00
ebanks 6b960bd9c5 Fix for Steve: genotype filters still want to see the values from the VC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3758 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-11 04:30:15 +00:00
depristo c3c66e853c Improvements for Jason
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3756 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 20:18:37 +00:00
ebanks 405be230d0 Various code improvements based on FindBugs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3755 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 15:04:48 +00:00
ebanks abaec13e38 Bug fix: if there are samples in the VCF but all of them are no-calls, we still need to emit GT for the FORMAT field to be on spec. Note that this is a holdover from 3.3 writing but can't easily be fixed there. Fortunately, that code is all going away soon...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3754 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 14:08:25 +00:00
chartl ea8fd506bf Update to PickSequenomProbes: Option to ignore mask sites within X bp of a variant (very useful for indels where dbSNP entries near the indel are almost always false SNP calls). Also fixed an integration test where the variant site itself, being in dbSNP, was represented as [N/C] rather than [A/C]. Added integration test for 1bp no-mask window.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3753 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 04:03:19 +00:00
depristo 179067e3f4 Support for . values in qual field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3752 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 01:47:02 +00:00
depristo 45fb614296 Fixes to VE for obscure bug, as well as disabled integration test for CombineVariants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3749 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 00:13:07 +00:00
rpoplin 67f1589652 --fdr_filter_level isn't mandatory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3748 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 22:48:30 +00:00
rpoplin 5d39cd5db8 Added --fdr_filter_level to ApplyVariantCuts so that you can create beautiful tranche plots and also decide which tranche level to filter at. The previous version always filtered at the smallest tranche. The tranche filter names are appropriately added to the VCF header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3747 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 22:44:10 +00:00
depristo 760aaeda88 Update to CombineVariants. Now splits merge options into variant and genotype options separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3746 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 20:09:48 +00:00
ebanks bd2ba3eb37 deal with very large known indels that fall off our ref context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3745 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 20:05:16 +00:00
aaron 12fecc8d8f remove the picard DbSNP ROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3743 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 17:46:00 +00:00
depristo 56a0c7ee6f All headers are now converted to VCF4 by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3741 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 14:14:17 +00:00
ebanks 6e6ad36523 reallow MNP events through
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3740 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 06:26:52 +00:00
ebanks ed0d0d78fa corresponding fix for dealing with insertions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3739 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 05:25:03 +00:00
ebanks ada8c9931f We were never clipping the VCF-provided ref base off the left end of the alleles for insertions, so the reference allele was never null (and downstream walkers would fail). Didn't this get tested with insertions at some point?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3738 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 05:24:27 +00:00
ebanks 9a81f1d7ef Fixed this tool for chartl so that it now properly handles deletions. Added deletion case to integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3737 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 04:45:59 +00:00
ebanks 47a42b1507 trivial cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3736 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 04:42:32 +00:00
ebanks b7a3d1e61f Bug fix: if the FORMAT field consisted of just GT, we were exceptioning out. How did we not catch this until now?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3735 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 04:41:40 +00:00
ebanks 1c146aebe8 Fix logic bug
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3734 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-08 04:32:46 +00:00
hanna 9fc05ac2ae eagerDecode is now false.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3733 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 22:51:48 +00:00
ebanks 4bc3ad2194 Shame on me: UG was emitting negative QUALs (-0) in all_bases mode. Thanks, Matt.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3732 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 20:30:22 +00:00
ebanks 30714ec8d9 As per quick chat with Richard Durban, don't increase the mapping quality of realigned reads too much; for now, arbitrarily increase the MQ by 10. We need to figure out a better solution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3731 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 20:12:59 +00:00
ebanks 8ff1a4b929 Don't try to clean reads that fail the PF, in preparation for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3730 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 19:49:36 +00:00
depristo b934cc7554 Updates to fix some bugs in merger. Now able to merge into project wide indel VCF files. Integration teests coming tomorrow
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3727 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 03:16:33 +00:00
kshakir 7be8c35eb2 Workaround for scala trait erasing parameterized types:
- Requiring explicit @ClassType on parameterized fields in traits.
- Scatter / Gather functions are now abstract classes since @ClassType can't be used on parameterized fields with type parameters.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3726 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 03:15:10 +00:00
hanna 120f90da5b Interval support for ref walkers while streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3725 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 03:14:59 +00:00
hanna 773a72e6ea An initial fix for performance issues when filtering UG with new StratifiedAlignmentContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3724 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 01:07:46 +00:00
delangel be75b087ec a) Add input argument (-ncrate) to BeagleOutputToVCFWalker. If the genotype posterior error probability is higher than this threshold, we declare No-call at this genotype.
b) Add "OG" annotation to genotypes. If Beagle changes genotypes, this annotation gets the original genotype call, to ease performance  comparisons. If not, this annotation gets an empty value.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3723 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-06 18:33:28 +00:00
hanna 4213e05aeb Fix for sharding ref walkers via monolithic sharding. Introduces the potential bug (for
monolithic sharding only) that when traversing by read, map() function will not be called for loci
off the end of the reference.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3722 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-06 04:34:38 +00:00
aaron 86031f4034 part two: todo's in combine variants, fixes for InferredGeneticContext, and some other tests and clean-up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3721 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-05 21:07:53 +00:00
ebanks 36edc60ccc Connected UG to the new comp track annotation system in VA. Also, when emit confidence is lower than call confidence (so that we emit records filtered with LowQual), add a corresponding FILTER header field to the VCF so that the validator doesn't complain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3720 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-05 13:04:24 +00:00
aaron 3347d1ca7c part one of combining format and info header lines code into a single abstract class for Mark; plus some 'm' removals from access methods for Eric. Adding fixes for CombineVariants next.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3719 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-05 05:57:58 +00:00
ebanks e7220bc885 Variant Context simple merging routine should keep ID if one of the VCs has it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3718 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-05 01:10:15 +00:00
delangel 3016e1cf80 Fixes to increase robustness in vcf4 writer. We assume that only at most 1 base was clipped from beginning of allele encoding by reader, and improve the way we find if bases were clipped. We still cant deal with some corner cases, and duplicate records may follow, for example if a snp location is followed at the next base by an indel. Also, if we are reading form a 3.3 vcf and the reference is null (ie we have an insertion), the reference base is not computed correctly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3717 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-04 20:22:04 +00:00
ebanks 07945040f8 Set VariantFiltration's JEXL engine to silent for warning messages
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3716 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-04 18:11:19 +00:00
ebanks be8740b00d Another edge case in left alignment for indels: deal with cases when insertions are ambiguously placed at ends of reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3715 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-04 17:26:38 +00:00
weisburd 9ec393bfce Updated md5 - vcf header line change
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3714 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 21:02:09 +00:00
weisburd f7593435eb Implemented decodeLoc(..)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3713 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 21:01:36 +00:00
depristo cd2e4b0a1e merging now very close to working. Bug todo in writer and vcf infrastructure. Can almost create merged snp and indel files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3712 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 20:09:25 +00:00
delangel b6bdd61283 a) Fix bug when multi-base reference is homopolymeric when writing a VCF4.0 variant context: computation of number of trailing bases was incorrect and we ended up with incorrect position.
b) Updated VCF4WriterTestWalker to take either VCF3 or VCF4 as inputs (this walker can also be used to convert from 3.3 to 4.0).
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3711 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 15:19:42 +00:00
depristo 61e2b2e39b Nearly finalize merging capabilities for CombineVariants. Support for dealing with inconsistent indel alleles at loci. Improvements to Allele and removal of addAllele to MutableGenotype. We are close to being able to merge all of 1000 genomes -- snps and indels -- into a single combined vcf
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3710 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-02 13:32:33 +00:00
hanna cab8394103 The sharding system now buffers reads, with a size determined by command-line argument. Will investigate whether/how this
impacts performance on low-pass data and, if it works well, will create a more automatic version of the tool.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3709 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:28:55 +00:00
aaron f967cae1aa tiny comment change
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3708 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:04:25 +00:00
aaron 3093a20a55 fixing VCF header format and info fields so that they propery emit the unbounded count value correctly for vcf4 or vcf3. Eric we should update the vcf4 spec page to indicate format fields are allowed to use the unbounded count as well (if this is true).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3707 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 22:02:16 +00:00
delangel 61c07c6f90 Fixes for missing key values that can create null pointer exceptions when reading from 3.3-generated variant contexts. Also, chop missing genotype fields correctly from right to left
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3706 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 20:17:03 +00:00
rpoplin 255b036fb5 Variant Recalibrator MLE EM algorithm is moved over to variational Bayes EM in order to eliminate problems with singularities when clustering in higher than two dimensions. Because of this there is no longer a number of Gaussians parameter. Wiki will be updated shortly with new recommended command.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3704 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 18:51:07 +00:00
aaron 4903d1fb4f fix for a parallelization issue: moving the creation of iterators outside of the sync block so we don't wait for RMD tracks to seek to the correct location. Thanks to Ben for providing the test case!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3703 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 16:37:02 +00:00
aaron 43ca595d15 VCF headers now can be set to a particular VCF version after creation, which converts the header lines to the appropriate encoding on output. Plus some clean-up of the code.
Also commented out the Tribble index out-of-date tests, the timing seems to be troublesome from the farm.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3702 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 05:32:14 +00:00
hanna 4995950d04 IndexedFastaSequenceFile is now in Picard; transitioning to that implementation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 04:40:31 +00:00
hanna c9d5345150 Redo StratifiedAlignmentContext to use ReadBackedPileup's stratification options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3699 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 02:46:05 +00:00
delangel dc4715c9c6 Permit empty fields in INFO and FORMAT structures - not fully tested yet but at least failing cases before now pass. Also, corrected a bug where in case we were reading 3.3 VCF's, or VCFs with no original allele encodings, we'd always print 2 bases per allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3698 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 01:56:07 +00:00
depristo 5f2b2d860e Final stage of renaming
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3696 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 21:39:07 +00:00
depristo 6e7927a47d Continuing the renaming nightmare...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3695 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:25:01 +00:00
depristo 9d7d5f1747 Continuing the renaming nightmare...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3694 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:24:27 +00:00
depristo aa20c52b88 deleting vcf
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3693 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:19:15 +00:00
depristo 4195fc5c4e renaming part 2...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3692 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:18:11 +00:00
depristo 6c9da5525d renaming starting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3691 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:16:51 +00:00
depristo b8d6a95e7a Preliminary commit of new VCFCombine, soon to be called CombineVariants (next commit) that support merging any number of VCF files via a general VC merge routine that support prioritization and merging of samples! It's now possible to merge the pilot1/2/3 call sets into a single (monster) VCF taking genotypes from pilot2, then pilot3, then pilot1 as needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3690 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 20:13:03 +00:00
kshakir 178cf64a0c Refactored ArgumentDefinition to absorb functionality from ArgumentDefinition and ArgumentTypeDescriptor.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3688 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 18:37:58 +00:00
chartl 569456850d Mark pointed out there's differentiation in the filter field. Rolling back.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3687 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 17:05:53 +00:00
chartl 52a474b27d Fixed an issue with VCF combine in sites like the following:
Broad: Filtered     BC: No call

These were being treated the same as

Broad: Call         BC: No call

Added some verbosity to separate them.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3686 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 16:49:31 +00:00
ebanks 944dbb94ce Refactored and generalized the database/comp annotations in VariantAnnotator. Now one can provide comp tracks as with VariantEval (e.g. compHapMap, comp1KG_CEU) and the INFO field will be annotated with the track name (without the 'comp') if the variant record overlaps a comp site (e.g. ...;1KG_CEU;...). This means that you can now pass 1kg calls to the Unified Genotyper and automatically have records annotated with their presence in 1kg.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3684 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 16:37:31 +00:00
ebanks 47c4a70ac1 It turns out that it is legitimately possible for there to be reads that won't overlap within a target interval for cleaning. While we don't want to attempt cleaning, we also don't want to fail.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3682 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 15:50:44 +00:00
ebanks ae33d8a2f2 I just wanted one more vote. It's settled: we die.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3681 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 14:00:56 +00:00
ebanks 8fb37f5f7a For Kiran: warn the user when the actual and vcf ref bases differ so that if an exception is generated later, he knows why. All: should we generate the actual exception here? Is there any reason to allow cases where the vcf record has a different ref base than the actual reference? I'd vote that we die here. Thoughts?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3680 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 13:56:16 +00:00
delangel d932322190 More necessary fixes for VCF4.0 - now results look more sensible in realistic, bigger VCF files produced by say Dindel and not just the small test VCF:
- Fixed and cleaned code to produce trailing and padding bases in alleles around indels.
- Deal better with missing fields.
Pending:
- Chopping missing fields at end of genotypes.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3679 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 02:59:30 +00:00
ebanks 12c0de6170 Added ability to clean using only known indels. Added integration test for it. Fixed vcf->vc conversion for indels which was busted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3678 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 01:20:56 +00:00
chartl 610cc7ae2b Cool package trick Kiran showed me. VariantEvaluator no longer public, AAT specifies the core package even though it lives in oneoffs. Disabled so integration tests pass.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3677 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 22:42:04 +00:00
chartl 4c6f4e41c6 Include making VariantEvaluator public within the package so my oneoffs can be seen (not included in previous submit specifically because I didn't want to break the build by changing anything in core...the road to hell is paved with good intentions)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3676 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 22:26:52 +00:00
chartl 9ac13b8f5d Name and body change for this module to reflect local code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3675 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 21:45:26 +00:00
aaron 844cb2ed33 fixing a bug that Eric found with RODs for reads, where some records could be omitted. Sorry Eric!
Also putting more tolerance into the timing on the tibble index tests (that check to make sure we're deleting out of date indexes, and not deleting perfectly good indexes).  It seems that some of the farm nodes aren't great with a stopwatch.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3674 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 21:38:55 +00:00
chartl 101c27294d Comment this guy out so we build again. (Hate it when my repository goes all funky.)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3673 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 21:16:33 +00:00
chartl 3017f82550 Initial commit of items for analyzing amino acid transitions in variant eval. Blew up my subversion by coding locally while i did not have internet. I hope this doesn't bust any integrationtests since I changed no existing code but...who knows. Crossing my fingers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3672 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 20:57:18 +00:00
delangel e3fb4d5c70 Intermediate checkin, just to fix null pointer exception that happened when merging implementation with latest VCF4 decoder - field ORIGINAL_ALLELE_LIST in vc shouldn't be written in infoFields structure since this won't be output to file and there is no legal structure under this key.
Base encoding for complex events is still brittle and most probably still has issues, fixes upcoming.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3671 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 20:57:09 +00:00
ebanks baf9479c35 An addition for Sendu since he can't seem to tell when his CountCovariate jobs die in the middle of writing the CSVs. We now write an EOF marker at the end of the covariates table and look for it when reading in the file in TableRecalibrationWalker. By default, we warn the user if the EOF marker isn't present, but we exception out if the user provides the --fail_with_no_eof_marker option.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3670 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 18:50:07 +00:00
delangel 3ca2b7374b Fixes to better deal with the "Type" and "Number" field in the INFO and FORMAT header lines in VCF4.0. We now record these fields and provide appropriate conversions. This is the first version that passes fully the VCF validator.
Also, moved the flag indicating VCF4.0 to the VCFWriter constructor.

 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3669 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 16:43:00 +00:00
ebanks 801b47c6e9 For Sendu: a similar addition to the Indel Genotyper allowing it to emit a metrics file (which for now consists only of # of normal/tumor calls made)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3668 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 13:19:17 +00:00
ebanks ddf87e61c2 For Sendu: optionally emit a metrics file with callability info (including number of actual calls made) from UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3667 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 12:57:28 +00:00
ebanks 929e5b9276 Fix possible null pointer exception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3666 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 09:01:18 +00:00
hanna 2953c9f069 Efficiency improvement requested by the Picard team in IndexedFastaSequenceFile: improve the memory efficiency
(and loading time) of long reference sequences by better controlling the input buffer size.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3665 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-29 07:22:07 +00:00
delangel ed71e53dd4 1) Initial complete version of VCF4 writer. There are still issues (see below) but at least this version is fully functional. It incorporates getting rid of intermediate VCFRecord so we now operate from VariantContext objects directly to VCF 4.0 output.
See VCF4WriterTestWalker for usage example: it just amounts to adding
vcfWriter.add(vc,ref.getBases()) in walker.

add() method in VCFWriter is polymorphic and can also take a VCFRecord, lthough eventually this should be obsolete.
addRecord is still supported so all backward compatibility is maintained.

Resulting VCF4.0 are still not perfect, so additional changes are in progress. Specifically:
a) INFO codes of length 0 (e.g. HM, DB) are not emitted correctly (they should emit just "HM" but now they emit "HM=1").
b) Genotype values that are specified as Integer in header are ignored in type and are printed out as Doubles.

Both issues should be corrected with better header parsing.

2) Check in ability of Beagle to mask an additional percentage of genotype likelihoods (0 by default), for testing purposes.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3664 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 23:54:38 +00:00
ebanks 4a451949ba add parallel option to target creator for masking out reads with bad mates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3663 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 22:13:25 +00:00
ebanks 6a23edd911 Fix performance tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3662 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 21:51:48 +00:00
chartl 20f5fdbcf7 Changes to MVC to make the the header of its output VCF compliant with spec (give expected # of values for info field annotations)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3660 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 18:33:23 +00:00
aaron 62d22ff1aa adding the original allele list to a variant context (as the annotation ORIGINAL_ALLELE_LIST), in the case where the set alleles are the result of clipping. Added tests for both cases.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3658 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 17:23:46 +00:00
ebanks 1292c96e29 The cleaner now adds the OC (original cigar) and OS (original alignment start) tags as appropriate to reads that get realigned; this feature can be turned off. Also, improved integration tests (sorry, Kiran!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3657 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 16:46:47 +00:00
asivache cc8d8eaedb Now that we always reserve space for two read ends when collecting stats stratified by libraries, we need to check that the second end was indeed present; otherwise the pointer is null and this was causing an exception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3656 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 16:40:16 +00:00
ebanks 9a24598a98 By default, don't clean reads with mates mapped to other chromosomes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3654 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 15:14:20 +00:00
ebanks bf5cbad04c Make the target creator a rod walker (that allows reads) so that we can easily trigger the cleaner on only known indel sites. Adding an integration test to cover this case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3651 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 13:28:37 +00:00
ebanks 464ac63a22 Allowing N's in ALT field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3650 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 11:41:32 +00:00
hanna 3a9d426ca8 Added hasPileupBeenDownsampled() boolean to ReadBackedPileup, so that a pileup can report whether or not (but not how much) it's been downsampled.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3649 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 04:56:33 +00:00
ebanks 8e848ccd84 SAMFileWriters can now write to /dev/null without throwing exceptions, so we can remove the try/catch blocks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3648 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-27 03:59:10 +00:00
aaron 09ccdf83b2 fixing a broken test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3647 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 21:59:00 +00:00
depristo d6cbe4d0ad Bug fixes to support haploid genotypes, optimization for indexing, now tracks the line of the VCF and catches errors to tell you the line no and line when a parsing error occurred.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3646 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 21:08:41 +00:00
aaron 5f8a3f95ef The GT field once again reigns supreme (it must be the first genotype field). Thanks for the catch Eric.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3645 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 21:03:05 +00:00
kshakir 75c98c42b8 Started path of deprecation of Sting's @Argument by splitting the annotation into @Output and @Input. Anything that's not an @Output should be an @Input.
Checked in example qscripts that are basically todo integration tests.
Replaced use of queue @Input/@Output with Sting's new @Input/@Output.  This means you'll now have to doc-ument the annotations.
More work on dependency resolution cycles being created in the graph during scatter/gather.
Filtering nulls to avoid NPE exceptions in scala's 'Collection'.hashCode.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3643 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 20:51:13 +00:00
weisburd 147ba68441 Fixed bug with mrnaCoord field - made it count exon positions only, rather than introns & exons
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3642 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 19:53:32 +00:00
aaron d3848745ab moving VCF 3.3 back into the GATK so Guillermo can make changes for VCF 4 output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3639 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 18:20:06 +00:00
aaron b3edb7dc08 two fixes for the VCF 4 parser:
- Allow the "GT" field in genotypes at any point in the genotype string (before we required they be the first key-value pair).
- Fix a bug with the phasing value put into the VariantContext, thanks for the catch Guillermo!

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3638 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 18:01:23 +00:00
weisburd e15fe6858e Disabling test - Will need to update big-tables soon.. will re-enable after updating md5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3637 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 15:43:41 +00:00
aaron f9c7803d4e this got left off my last commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3635 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 02:42:44 +00:00
aaron 682f9b46c6 Two fixes together:
1) Some improvements to the VCF4 parsing, including disabling validation.
2) Reimplemented RefSeq in the new Tribble-style rod system.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3630 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-24 22:17:03 +00:00
aaron 62bc7651a8 fix for PSPW with DbSNP mask. Added an integration test for this case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3628 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-24 19:31:32 +00:00
aaron 8a9b2f4256 removing the GLF ROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3624 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 22:51:45 +00:00
aaron 611d834092 a couple of VCF 4 improvements:
-Validation of INFO and FORMAT fields.
-Conversion to the the correct type for info fields (i.e. allele frequency is now stored as a float instead of a string).
-Checks for CNV style alternate allele encodings( i.e. <INS:ME:L1>), right now we exception out.  Maybe we should just warn the user?
-Tests for the multiple-base polymorphism allele case.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3622 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 20:21:43 +00:00
ebanks f0fc34bb8e Bug fix: N's are allowed in the ref so don't fail when e.g. dbsnp has an N!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3620 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 17:49:14 +00:00
ebanks b6bceb39b0 Fixing up output for performance tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3619 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 17:00:17 +00:00
chartl 75d4736600 Committing changes to comp overlap for indels. Passes all integration tests; minor changes to MVC walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3618 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 15:49:13 +00:00
ebanks 9b8775180e Turn on the memory improvement by default (assume the target interval list is sorted, since it is 99.9% of the time). Make the user throw a flag when it's specfically not sorted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3617 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 15:44:55 +00:00
hanna 003dd4de3e Rev Picard with performance enhancements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3615 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 22:54:23 +00:00
aaron 0cafd3d642 clip VCF alleles for indels: only a single left base, and as many right bases as align before converting to variant context.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3614 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 22:42:38 +00:00
aaron 9872b65803 clip to the null allele on the reference string in VCF 4, instead of stopping to perserve one reference base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3613 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 20:52:19 +00:00
ebanks b5df2705c9 -Remove Nway output option
-Remove in-memory sorting
-Default to name-sorting (although we allow coordinate sorting with the --sortInCoordinateOrderEvenThoughItIsHighlyUnsafe flag).

Cleaner, faster code.  Wiki has been updated (including how to use FixMateInformation.jar from Picard).  More changes coming soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3612 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 20:31:55 +00:00
kshakir 30cf78fdc0 Refactoring for a first version of scatter gather api with basic shell script implementations.
Modified build script so that queue is cleaned during "ant clean".



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3611 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 18:39:20 +00:00
aaron a6d3e4bd47 Add code to allow reference alleles with 'N' in VariantContext, but not in the alternate allele(s). Also more updates to the VCF 4 code (fixed parsing for files without genotypes).
This check-in will temperarly break the build (I need to see if Bamboo is correctly returning the log file for the failed builds).  

Will be fixed once Bamboo starts building.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3609 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 18:26:37 +00:00
ebanks 824c2bbac0 Finishing previous checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3608 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 17:21:38 +00:00
ebanks 4727bcda24 Removing Beagle output from UG. Use ProduceBeagleInput walker instead (since it can be run post-filtration and respects the FILTER column).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3607 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 16:56:37 +00:00
aaron 32f324a009 incremental changes to the VCF4 codec, including allele clipping down to the minimum reference allele; adding unit testing for certain aspects of the parsing. Not ready for prime-time yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3604 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 06:31:05 +00:00
bthomas de9f1f575f Fixing command line parsing to accept negative number arguments. Command line definitions must now start with a letter or underscore; previously, they could start with a digit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3603 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 21:54:31 +00:00
bthomas 9d6a341d15 Fixing the error messages thrown with bad interval arguments. I simplified the exception handling and made the messages more verbose.
Note: the -L argument takes both interval strings and filenames. If you specify an interval string that is also a file, an error will be thrown to move the file: ie. if you have a file "chr1" in the parent directory, GATK will ask you to move/delete it. But, this only happens with interval string arguments, NOT with intervals that are contained in files, which is a majority of the use case. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3602 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 21:49:41 +00:00
bthomas 300a18b85f Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually.
Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. 

Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 21:42:42 +00:00
ebanks df1cadc4c9 Fix NullPointerException when priority list is left out
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3600 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 13:46:54 +00:00
hanna c806ffba5f Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations
will not be as fast as they could be because the workflow is currently merge sam records (sharding)
-> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split
records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext
is updated to take advantage of the new functionality in ReadBackedPileup.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 02:11:42 +00:00
hanna 1d50fc7087 Misc bug fixes: fix tracking of nInsertions with sample-split pileup constructor. Fix performance
issue building up pileups from pileups of individual sample data.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3598 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-20 20:32:27 +00:00
hanna f18ac069e2 A refactoring / unification of ReadBackedPileup and ReadBackedExtendedEventPileup.
Provides a cleaner interface with extended events inheriting all of the basic RBP
functionality.  Implementation is still slightly messy, but should allow users to
provide separate implementations of methods for sample split pileups and unsplit
pileups for efficiency's sake.
Methods not covered by unit/integration tests have not been sufficiently tested yet.
Unit tests will follow this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3597 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-20 04:42:26 +00:00
depristo 57a13805da GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-19 21:36:41 +00:00
kiran 8ff93f77e6 Added evaluation module to count functional classes (missense, nonsense, etc.). At the moment, it only understands Cancer's MAF annotations. Added integration test for the functional class counting. Added better description for VariantEval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3595 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 21:51:40 +00:00
ebanks 1e06d2bf68 Initial HLA Caller integration tests. Kind of painful, but will improve with code refactoring.
This baby is now officially ours.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3593 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 20:35:27 +00:00
chartl f44d8b150f Mendelian Violation Classifier now filters violations on the fly via command line arguments; and closes unterminated homozygous regions at the end of a chromosome (so we see arms falling off in the file, rather than in the log)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3592 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 19:32:24 +00:00
ebanks aa1852575e Add -noVerbose flag to stop output of INFO data.
Cuts runtime by 30% and output from 65Mb to 1Kb.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3591 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 18:53:35 +00:00
rpoplin 724affc3cc Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 17:37:11 +00:00
aaron c3434493b0 fixed integration test for VCF Header changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3589 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 16:31:48 +00:00
hanna 52477bd9e6 Add some missing methods to the pileup architecture.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3588 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 15:03:08 +00:00
hanna 5050b19457 We're unable to make the naive deduper more worldly, so we're killing it instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3587 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 13:54:27 +00:00
aaron 42e7ff4f28 forgot to update a test, the md5sum of the underlying file changed (which is recorded in the ROD tests).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3586 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 13:27:56 +00:00
aaron b978d5946b adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 08:23:23 +00:00
weisburd e26a273ef5 Turned the test back on
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3582 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:57:42 +00:00
hanna 48cbc5ce37 Merging the sharding-specific inherited classes down into the base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:36:13 +00:00
hanna 612c3fdd9d First pass at eliminating the old sharding system. Classes required for the original sharding system
are gone where I could identify them, but hierarchies that split to support two sharding systems have
not yet been taken apart.
@Eric: ~4k lines.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 20:17:31 +00:00
delangel b694ca9633 Cleanup: Don't require likelihood ROD in Beagle parameters when generating output VCF. Likelihoods file is only an input to Beagle but the Walker that generates a VCF doesn't need it, so it's silly to ask for it and it's error-prone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3579 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 17:45:48 +00:00
hanna c1595a383a More bugfixes for cases where no sample name is present.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3578 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 16:46:02 +00:00
aaron 3d049204ed some refactoring for the variant eval output system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3576 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 05:34:31 +00:00
hanna db1383d0b2 Rev the latest version of Picard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3575 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 23:55:07 +00:00
weisburd 5b370ffc62 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3574 348d0f76-0448-11de-a6fe-93d51630548a 2010-06-16 20:42:58 +00:00
hanna 5972ad1199 Fixes to mrl integration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3573 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 20:40:10 +00:00
ebanks b75ded61b8 Removing obsolete rod; no longer needed given previous addition to SampleUtils.
JIRA GSA-318


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3572 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 20:03:14 +00:00
kshakir c671864228 Re-allowing blacklist by read group id.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3571 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:45:44 +00:00
ebanks f003703912 Allow specification of particular rods for pulling out sample names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3570 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:37:09 +00:00
ebanks 01ffa307c2 When going NWay out in the cleaner, use the new *merged* header (instead of the original one) for each bam file so that it matches the new uniquified read group ids in the reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3569 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:36:36 +00:00
kshakir 05c2f96bb4 Small update to the command line docs for read_group_black_list.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3568 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:23:34 +00:00
ebanks d7f3102c3f Fixed read group blacklist filter to look only at readgroups (and not the read's themselves). Otherwise, it fails when attribute tags with different meanings show up in both places (e.g. SM). Added performance improvement.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3567 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:14:37 +00:00
hanna e77f76f8e1 Reenabled downsampling by sample after basic sanity testing and fixes of the
new implementation.  Hard testing and performance enhancements are still
pending.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3566 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 17:23:27 +00:00
kshakir c44fd05aa1 Fix for a reflection issue with generic types.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3565 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 15:58:38 +00:00
ebanks 7a91dbd490 Renamed some of the column names in Ti/Tv and Concordance modules so that they are clearer. Removed ValidationRate module (it was busted).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3564 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 15:53:06 +00:00
delangel 8cb16a1d45 a) Cleanup, remove -input argument from BeagleOutputToVCFWalker since it's not needed.
b) Added back old Beagle ROD to maintain backward compatibility (does anyone even use this???)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3563 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 02:13:08 +00:00
delangel d319a28be7 Complete rewrite of the Beagle functionality to read from Beagle output files and produce VCF with modified genotypes. Now, a new ROD system using Tribble is in place. Beagle inputs are set using -B beagleType,Beagle,pathToBeagleFile, where beagleType can be either beagleR2, beagleLike, beaglePhased or beagleR2 (BeagleOutputToVCFWalker requires all of the above). Only pending items: -input argument is now unused and can be removed, will be cleaned later. Wiki will be updated with new usage shortly.
We can now run with a reduced memory footprint, and output VCF is exactly identical to previous version. Drawback is increased runtime because Tribble has to create an index for all the Beagle files when starting if the idx files are missing.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3562 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 02:01:35 +00:00
aaron d265397bf6 removing a reference to a unused internal Sun class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3560 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 15:27:57 +00:00
asivache 42b8a8f295 slight change in output format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3559 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 14:52:04 +00:00
kshakir 32fc221ffe Replaced pattern matched pipeline spec with annotated objects.
Old version is no longer available.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3558 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 04:43:46 +00:00
sjia b99a5e06f3 Added option to only consider alleles of > specific allele frequency.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3557 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 02:09:35 +00:00
hanna 8a895f481f Proper exception chaining for troubleshooting Sendu's issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3556 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 01:38:36 +00:00
sjia 8defb30796 Documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3555 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 21:31:01 +00:00
weisburd c1046653a2 Fixed handling of records where gene-names are identical (eg. as in refseq NR_030638 in chr20)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3554 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 20:00:49 +00:00
weisburd 1e42984a16 Improved buffer-size arg handling
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3553 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 19:59:15 +00:00
sjia b3c3023c3c Allows callers to handle HLA reference files as input (rather than hard-coded paths)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3552 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 18:56:08 +00:00
asivache 9666d47d17 ooops, debug print now removed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3550 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 18:07:12 +00:00
sjia abdc8521ea Added debug options for FindClosestHLAWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3549 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 17:52:03 +00:00
sjia c38390eabb Added option for min number of matches between reads and alleles required to consider reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3548 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 16:08:49 +00:00
asivache 4ab1f440c3 A new argument: --targetIntervalsSorted (boolean flag). If specified, the interval file is assumed to be sorted (duh!) and it is NOT slurped into the memory but instead traversed directly on disk as needed. If the file turns out to be unsorted, an exception will be thrown at the point where inconsistency occurs (can be late into the processing!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3547 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 16:00:22 +00:00
asivache 671ac00748 A simple utility class that implements a merging Iterator<GenomeLoc> built over an interval or bed file (this is NOT a rod, but rather a direct line-by-line file reader that converts strings to genome locs on the fly and merges overlapping intervals)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3546 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 15:54:37 +00:00
asivache f137bf8f85 now adaptor silently skips empty lines in the underlying string iterator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3545 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 15:35:07 +00:00
sjia d8c963c91c Remove PhaselikelihoodsWalker.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3544 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 15:21:43 +00:00
sjia 5704294f9d HLA caller updated - now searches all (common and rare) alleles, more efficient read filtering and allele comparison runs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3543 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 15:14:40 +00:00
asivache d51e6c45a7 a utility class; turns string iterator into GenomeLoc iterator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3542 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 14:07:44 +00:00
asivache 7b7d3341f0 trivial refactoring: isFile renamed to isIntervalFile and made public
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3541 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 14:02:23 +00:00
hanna c3b68cc58d Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept
independent while processing, and only merged back in a priority queue if necessary in a special
variant of the ReadBackedPileup.  This code is not live yet except in the case of naive deduping.
Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and
not well integrated with StratifiedAlignmentContext or the walkers.  Cleanup to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-13 01:47:02 +00:00
kiran 804facb0cc Removing these utilities as part of a hostage negotation with Matt. Can I have my journal club paper now?!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3539 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 21:41:29 +00:00
asivache e6d8faf293 making 'parseLocation' public static - as simple as the logic is, it's better kept in one place and I need it!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3537 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 18:19:59 +00:00
ebanks 8c28be5933 Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this).
Updated dbsnp/hapmap membership info fields to be flags now instead of ints.
While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 16:42:06 +00:00
ebanks 22620ba95c Adding "abi_solid" to the list of known platforms.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3534 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 13:37:19 +00:00
ebanks 63ad71cca6 Fix busted code. Note for all:
String.valueOf(byte[]) doesn't work.  You must use new String(byte[]).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3533 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 05:01:48 +00:00
weisburd 338bb9adf4 CommandLineProgram for measuring java I/O speeds for large plain-text or gzipped files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3532 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 21:34:37 +00:00
weisburd 06fc5eecf8 Implemented TreeReducible - if num threads > 1, the output will be accumulated in memory and written to a vcf file at the end - in onTraveralDone(..). If num threads == 1, things will work as before - where vcf records are written to disk as soon as they are computed with map(..).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3530 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:57:23 +00:00
weisburd 3b375cb237 Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) - attempt 2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3529 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:54:36 +00:00
bthomas 99b684ea89 Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:10:23 +00:00
hanna f55f32d4ee Bug fix.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3526 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 01:53:26 +00:00
ebanks ca4eab1d23 Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields.
Added an integration test for the read-less case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 20:36:46 +00:00
aaron 6941c81bfa reverting revision 3522 to the old code until we fix the tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3524 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 19:25:02 +00:00
hanna dbee21a50f Bugfixes for the case when no read groups / no samples are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3523 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:47:05 +00:00
weisburd adc4c4e577 Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3522 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:11:43 +00:00
chartl 20167fd411 Final changes to MVC -- associates variants with regions of homozygosity in child and parents, corrects for genotype errors, and prints out a separate file with informationf or each region of homozygosity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3521 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:05:37 +00:00
weisburd fdded73861 Improved error reporting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3520 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:52:48 +00:00
aaron 4f00e265a8 quick update for a change I implemented for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3519 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:23:31 +00:00
aaron ad98512f6c adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:21:12 +00:00
weisburd c1b7bcc786 Fixed handling of mitochondrial genes - added special cases such as ATT being a start codon in mitochondria. Added warning if a gene doesn't start with Met or end in a stop codon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3517 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:15:47 +00:00
weisburd 4f1181974b Added toString() method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3516 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:12:57 +00:00
ebanks 9b2fcc4711 Refactoring of the annotation system:
1. VA is now a ROD walker so it no longer requires reads (needs a little more testing)
2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs)
3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong.  Fixed the headers too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:05:51 +00:00
hanna 84563b37e5 Partial flattening of the hanger data structure. Hanger data structure is
not currently as flat as it could / should be, but it's already comparable
to the speed of the reference implementation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3512 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 16:28:49 +00:00
chartl 8f9e3e8ad7 Commit for Kiran; but this is now working, barring little exceptions that I've yet to run across...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3511 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 14:21:19 +00:00
aaron 6d5556939d updating Tribble with a couple of important Tabix fixes, and updating the variant eval integration tests to run each test with both plain vcf and gzipped tabix (added the tabix version
to the vlidation directory), using the same md5sum.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3509 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 01:47:04 +00:00
hanna c2858c8988 Minor performance enhancement. Checkpoint commit before major performance
overhaul.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3504 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:39:39 +00:00
chartl 5ed2818ffb Forgot to commit code i relied upon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3503 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:01:35 +00:00
chartl 736098b58d A quick commit before running home. This is a re-factored version of the OppositeHomozygoteClassifier which will work with deNovo violations as well. Some code still needs to be migrated from OHC which is wy that walker isn't yet deleted. This'll be up and running tonight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3502 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 20:47:01 +00:00
delangel de134c226d Removed ability of users to specify annotations to recompute, cleanups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3501 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 19:17:59 +00:00
ebanks 4d1a6b3d99 quick changes for G
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3500 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 16:33:27 +00:00
delangel 907931c902 a) Update annotations when creating new vcf with Beagle's imputed data. Since genotypes may (will) change based on imputation, several annotations need to be updated. By default, AC, AF, AN and AB will be updated. User can force extra annotaqtions to be updated with -A <annotation> argument.
b) Several cleanups and beautifications.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3499 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 15:12:04 +00:00
chartl 933133ee28 Initial commit of the opposite homozygote classifier. Currently does the following, given a trio vcf:
+ Identifies opposite homozygote sites
 + Identifies the parent from whom it is expected that a null allele was inherited (or whether it was a putative genotype error; e.g. mom=homref, dad=homref, child=homvar)
 + Labels each opposite homozygote with its homozygous region in the child (e.g. region 1, region 2)
 + Labels each opposite homozygote with the size of the homozygous region in which it was found, the number of child homozygotes in the region, and the number of opposite homozygote violations within that region

To come:
 + Classification of sites as likely tri-allelic


Note that this is very experimental



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3498 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 03:56:07 +00:00
hanna 199e4208cd Bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3497 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 00:30:33 +00:00
hanna 52ab9f2417 Feature parity between LocusIteratorByState, DownsamplingLocusIteratorByState, including pushing mrl /
the LocusOverflowTracker into LocusIteratorByState.  Note that the 'Matt Hanna exception', is still enabled
because I haven't yet validated the performance of the DownsamplingLocusIteratorByState when running
without downsampling.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3496 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 22:58:21 +00:00
hanna 5c4d070566 Push Mark's changes in LocusIteratorByState into DownsamplingLocusIteratorByState
in preparation for merging the two into one.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3495 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 17:29:30 +00:00
depristo 6eeb1693ca JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 00:33:02 +00:00
hanna c1ecf75dd5 Update to the latest rev of the picard sharding patch. Includes updates reflecting
the imminent move of IlluminaUtil into picard public.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3493 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 20:33:21 +00:00
delangel c503f01dcf More cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3492 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 17:41:38 +00:00
delangel d4c66d6191 a) Small cleanup
b) Fix major issue with Beagle likelihood converter: if likelihood triplets from UG end up being too low, then Beagle input file will be produced with 0.00,0.00,0.00 triplet. If all samples at a marker have this issue, Beagle will effectively produce junk. To fix, likelihoods are renormalized before converting to linear space.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3491 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 17:31:59 +00:00
depristo cfa18f6743 Fixing missed update with new Allele in it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3490 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:56:34 +00:00
depristo 3ea506fe52 No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:03:55 +00:00
delangel ef47a69c50 a) First fully functional (sort of) version of walker that parses Beagle imputation output files and produce a vcf with imputed genotypes.
More doc/info to follow shortly. Issues still to be solved:
a) Walker changes all genotypes based on Beagle data, but annotations on the original VCF are unchanged. They should in theory be recomputed based on new genotypes.
b) Current implementation is ugly, dirty unwieldy and will necessitate a refactoring soon so I can keep my pride. Most aesthetically affronting issue right now is that we read the full Beagle files at initialization and keep them in memory, but a more delicate implementation would just read from files on a marker by marker basis. Issue that currently prevents this is that BufferedReader() instances don't seem to play nice when called from the map() function.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3488 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 20:37:25 +00:00
depristo b811e61ae1 Optimized, nearly complete VCF4 reader 2-4x faster than the previous implementation, along with a VCF4 reader performance testing walker that can read 3/4 files, useful for benchmarking
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3487 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 18:11:38 +00:00
aaron 6482b87741 adding the super experimental, half-broken, generally crippled, awkwardly commented, header ignoring vcf4 code. Don't use this, unless you're a developer for VCF4. If so, remove the exception from the constructor so that it won't always exception out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3486 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 07:38:46 +00:00
aaron 0b03e28b60 updating the tribble library to include the reference dictionary reading / writing. We now check the dictionaries of any tracks that have them against the reference (all new tribble tracks and out-of-date tracks will have this). Also renamed some classes to be more reflective of their function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3485 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 06:34:26 +00:00
hanna 3d055e3d16 Fail fast if users try to parallelize a read walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3484 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 18:14:33 +00:00
hanna 7d79848f40 Better error message when bam file / list file with wrong extension is
supplied.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3483 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 17:52:48 +00:00
ebanks 597b3744ab Always use phasing info when converting genotypes to strings
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3482 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 17:50:50 +00:00
depristo e2b41082af GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 22:26:32 +00:00
aaron 8ec091d6d2 re-enabling regeneration of the tribble index if it's out of date. Also moved the class that can detect text in the log4j stream (useful in testing to make sure appropriate messages are generated).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3480 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 17:45:51 +00:00
asivache f0c379dde8 Unconsequential changes in report formatting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3479 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 17:43:25 +00:00
weisburd 3ab936181c Supports the join feature of GenomicAnnotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3478 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:29:57 +00:00
weisburd f5f7217413 Implemented joins
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3477 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:28:53 +00:00
weisburd 09c3b15af3 Implemented joins
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3476 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:28:06 +00:00
weisburd e14ae471a0 Refactored some of the small utility methods
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3475 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:26:00 +00:00
weisburd 898a78e97d Added toString()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3474 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:24:25 +00:00
weisburd 12c3e3ecda Added back the check for values.size() != header.size(). Now exception will be thrown if number of columns in a record doesn't equal number of columns in the header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3473 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:23:05 +00:00
rpoplin 290771a8c2 Automatic cutting of recalibrated variant calls using ApplyVariantCuts. VariantRecalibrator produces the tranches plot alongside the optimization curve. Specify the levels using -tranche 1.0 -tranche 5.0 etc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3472 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 15:03:00 +00:00
ebanks 4a555827aa Removing more toUpperCase sanity checks
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3471 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 14:38:39 +00:00
ebanks 56e504789a trivial change: toUpperCase no longer necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3470 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 14:00:47 +00:00
rpoplin 87fe60fe4f Fix for Sendu. new Process and p.waitFor() don't seem to work on his farm. Throws an IOException. This was a problem way back with AnalyzeCovariates too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3469 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 11:37:10 +00:00
ebanks 7f0c638653 Fix for the indel cleaner: I forgot to "unclip" the cigar string (even though the clipped bases were removed) before using it as an alternate consensus in a particular instance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3468 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-01 02:07:20 +00:00
depristo 21427211c0 Personal MD5 database system now live. WalkerTest now maintains a database of result files associated with MD5 results in integrationtest/, and provides command lines for diff-ing expected to current md5 results when encountering failed intergration tests. The suite currently takes 200Mb to store. Update and run intergrationtest to build your very own expectation database for future development work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3466 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-31 16:06:16 +00:00
depristo 2b02324587 Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-30 18:00:12 +00:00
ebanks eb25e41111 minor update to new tribble name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3462 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 20:23:25 +00:00
ebanks ffeb3fd80d Thanks to Guillermo, I found a bug in the Unified Genotyper output: GL was posteriors instead of likelihoods. Not a huge deal because the
priors were flat, but fixed nonetheless.
Also, needed to update Tribble.
Minor updates to the Beagle input maker.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3461 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 19:28:26 +00:00
rpoplin 4e268ef6ac Removing the Variant Recalibration Performance test because it isn't ready yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3460 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:27:25 +00:00
rpoplin 522dd7a5b2 Adding the variantrecalibration classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3459 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:21:27 +00:00
rpoplin 2014837f8a VariantOptimizer package is moved to core, renamed as VariantRecalibration, and added to the binary release package. VariantOptimizer walker is renamed to GenerateVariantClustersWalker and ApplyVariantClustersWalker renamed to VariantRecalibrator. Integration tests added, performance tests still to be done.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3458 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:20:18 +00:00
aaron 871cf0f4f6 Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of:
@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class))

you'd say:

@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class))

Which is more in-line with what was done before.  All instances in the existing codebase should be switched over.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:52:44 +00:00
depristo cc2bf549c8 Removing my unnecessary optimization. 10 lines later in the code the same optimization was applied. A monumental waste of time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3455 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:10:48 +00:00
aaron a4d834cc01 fixing the test I broke
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3454 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 02:06:20 +00:00
depristo 6485e8383d Trivial change to retrigger broken build that really isn't broken
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3453 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:33:46 +00:00
depristo f2e7582cfc Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:16:00 +00:00
aaron e1b0aefb29 fix for parallelism bug
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3451 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 22:16:14 +00:00
aaron cded9ec985 adding a command line option, -etd (enable threaded debugging), that uses a custom thread pool class to catch exceptions thrown inside of a thread.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3450 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 21:57:56 +00:00
ebanks e2674671e7 The liftover code needs to *hard filter* records whose reference changes (since they no longer adhere to the VCF spec as they don't match the new reference - and can't be converted to VariantContext).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3448 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 19:22:47 +00:00
chartl ff4a0764df Read error rate is now parallelizable
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3447 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 19:00:09 +00:00
depristo dfc36c1e95 Restructuring of the mandatory read filters for traversals. Now everything uses ReadFilters, even for the required filters like being mapped for LocusWalkers. Statistics now tracked for each read filter used during the traversal and info emitted in INFO at the end.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3445 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 22:12:25 +00:00
delangel 3873dccb35 First fully functional (though preliminary) version of walker that takes an input VCF and outputs a Beagle .bgl file that can be used for missing genotype calls/haplotype imputation. For now, only supported input format is likelihood format for unrelated individuals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3444 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 21:03:23 +00:00
chartl f9efc1248c VariantEvalWalker now takes indels if you throw the -dels flag. IndelLengthHistogram appears to be working properly, it is turned off by default (as it is experimental) but you can turn it on in your own repository.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3443 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 20:03:14 +00:00
ebanks 058441fa39 Trivial renaming of test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3441 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 16:56:42 +00:00
chartl 0265199ce4 First pass at an IndelLengthHistogram module for variant annotator. Off by default. Will be tested shortly (have to commit, so I can check out in another directory, so that compiling won't kill all my jobs running on LSF)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3440 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 15:04:39 +00:00
aaron a2fab07258 fixed the build problem: there were two copies of the AnnotatorInputTable Codec and Feature in two different spots.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3439 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 14:47:15 +00:00
depristo 5928047d8b Optimization of reference window calculation to us bytes not char and no uppercasing since reference and read bases are always uppercase now. Should remove some ~5% of runtime of UG.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3438 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 14:10:26 +00:00
chartl 88a06ad81f Changes to Depth of Coverage:
- For speedup in large number of samples, base counts are done on a per read group level, then
   merged into counts on larger partitions (samples, libraries, etc)
   + passed all integration tests before next item
- Added additional summary item, a coverage threshold. Set by (possibly multiple) -ct flags,
   the summary outputs will have columns for "%_bases_covered_to_X"; both per sample, and
   per sample per interval summary files are effected (thus md5s changed for these)

NOTE:

This is the last revision that will include the per-gene summary files. Once DesignFileGenerator is sufficiently general, and has integration tests, it will be moved to core and the per-gene summary from Depth of Coverage will be retired.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3437 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 03:39:22 +00:00
ebanks 0607f76a15 commenting out this test until I can figure out what the hell is going on with the codecs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3436 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 01:12:10 +00:00
rpoplin 062b316881 Better Exception message when can't find annotation value in variant recalibrator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3434 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 21:15:50 +00:00
rpoplin bf530d23de Variant Recalibrator now makes use of a prior on known/novel status as well as on allele frequency spectrum. The VariantOptimizer walker now clusters with all variants but gives more weight to knowns / hapmap / 1KG / MQ1 sites. The weights are all optional command line arguments. We no longer assign default values to annotations that are malformed. The walkers will crash with exception so as to not cover up potential issues. We only produce titv-less clusters now, and so the titv argument in VO was removed and the WithoutTiTv string that gets added to the cluster file is removed. The wiki is updated to show new example commands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3433 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 21:08:31 +00:00
ebanks ae6c014884 Fixed UG parallelization bug. Better integration test to catch this in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3432 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 21:03:45 +00:00
ebanks 434e920da9 Oops, forgot to update integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3431 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 20:37:45 +00:00
ebanks 772f558ae0 Massive change to the indel realigner code. We now properly deal with soft-clipped reads. Also, improved left-alignment code.
Small change for Ryan to get hard-clipped reads working for the recalibrator.

PLEASE DO NOT RELEASE THIS WEEK.  I still have some more testing to do and need Mark to run WG jobs.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3430 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 20:04:33 +00:00
aaron f3e2aae570 add experimental support for tabix files (for any of our Tribble rod types), as long as they end in .gz and can be read by the tabix reader.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3429 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 04:44:46 +00:00
weisburd 8db7c97c4d Moved AnnotatorInputTableFeature and Codec to org.broadinstitute.sting.gatk.refdata.features.annotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3427 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-24 14:38:54 +00:00
weisburd 4aa749c709 Moved AnnotatorInputTableFeature and Codec to org.broadinstitute.sting.gatk.refdata.features.annotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3426 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-24 14:38:07 +00:00
weisburd aca3bcb193 Moved AnnotatorInputTableFeature and Codec to org.broadinstitute.sting.gatk.refdata.features.annotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3425 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-24 14:37:17 +00:00
weisburd 64ed770250 Moved AnnotatorInputTableFeature and Codec to org.broadinstitute.sting.gatk.refdata.features.annotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3424 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-24 14:36:28 +00:00
hanna ee3f2eb1d0 Don't output traversal reduce result in the logger. In many cases, the reduce
result is tangential to the product of the analysis and having the logger always
emit it can confuse the output (such as in the new reduceByInterval 
DepthOfCoverage walker).  If users want to emit it, they can choose not override
onTraversalDone, or override onTraversalDone and write results to the output
stream / logger / whatever their choice.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3422 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-23 22:41:43 +00:00
hanna a40e64e47b A downsampling validator. Compares the generated pileup passed in from the alignment context to the reads,
passed in as a Tribble SAM text feature.  If the generated pileup contains a valid set of reads according to
the downsampling rules, the test passes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3421 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-23 21:49:54 +00:00
delangel a280a0ff0d a) Made HaplotypeScore default annotation. This changed several integration tests, whose MD5 is now updated.
b) Disabled BaseQualRankSumTest, the returned p-values differ wildly from Matlab/R-provided ones, cause TBD.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3419 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 22:25:17 +00:00
hanna b10950c691 Simple performance optimization -- cache the number of reads in the locus hanger.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3417 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 19:26:16 +00:00
delangel 355396109b Bug fix to avoid build failure (class changed under me??)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3416 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 18:48:56 +00:00
delangel 1753d07b02 Added AnnotationByAlleleFrequencyWalker - walker takes an input vcf, a reference vcf and a list of annotations (with the -A argument). For each site present in both VCF's, it outputs the given annotations into the screen as well as allele frequency. Since HapMap vcf reference doesn't include AF in annotations, it computes it from Chromosome, Het and HomVar counts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3415 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 18:31:34 +00:00
chartl 745d7c582f added integration test for intervals with no coverage due to filtering
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3414 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 16:52:42 +00:00
chartl 7fb3f2d3eb Annotator now buffers indel calls (prevents double-output from double-calls to map)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3413 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 16:34:34 +00:00
chartl 4e834b5e35 VFW now uses a ref window and thus is compatible with indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3412 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 15:59:42 +00:00
chartl 88cb93cc3c Changes to Depth of Coverage (added maximum base and mapping quality flags; with new integration tests -- because they use b36, and the other test uses hg18, it's in a different class (integration test system can't change refs on the fly). Initial change to VariantAnnotator to allow it to see extended event pilups; you currently have to throw the -dels flag; and it's specified as "very experimental". Yet,all the integration tests pass.
Homopolymer Run now does the "right" thing (e.g. single bases are represented as HRun = 0 rather than HRun = 1) for indels. AlleleBalance now does something close enough to correct.

Added a convenience method to VariantContext that will return the indel length (or lengths if a site is not biallelic).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3409 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 13:02:01 +00:00
depristo 6faf101c6c Minor improvements to Callable Loci for public consumption
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3408 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 12:50:11 +00:00
hanna 388dd8d64d Fixing bugs in downsampler introduced when I added Ryan's dup eliminator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3407 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 02:53:12 +00:00
depristo a10fca0d5c Genotyper now is using bytes not chars. Passes all tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3406 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 21:02:44 +00:00
hanna 7389077b3b A few misc usability fixes:
- Clarify the message emitted when -XL is supplied so I don't spend another half day chasing a bug that doesn't exist.  
- Crash with a helpful message when running -nt with non-TreeReducible walkers.
- Crash with a helpful message when running -nt with reduceByInterval walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3405 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 19:02:02 +00:00
aaron b543dd4ac4 more aggressive checks for the locking, and some more documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3404 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 16:16:36 +00:00
depristo 1ab00e5895 Retiring multi-sample genotyper
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3401 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 14:10:56 +00:00
depristo 727822adb4 BaseUtils has more clear distinction between byte and char routines. All char routines are @Depreciated now. Please use bytes. Better organization of reverse(), now in Utils not BaseUtils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3400 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 14:05:13 +00:00
depristo 6ce3835622 Removing unused methods in QualityUtils; ReferenceContext now converting all bases to upper case, but can be disabled with static boolean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3399 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 12:38:06 +00:00
depristo 5abac5c057 A few more char -> byte cleanups
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3398 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 00:02:06 +00:00
depristo 8a725b6c93 Restructuring of ReferenceContext and ReadWalkers to accept a ReferenceContext. Now ReferenceContext is byte[] backed not char[]. Please no more chars for the reference. All of the tests pass now. Coming check-ins are going to clean up the char / byte problems in the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3397 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 23:27:55 +00:00
aaron 02cc1afdc8 remove RodBed and all it's dependencies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3396 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 19:12:30 +00:00
chartl ffb1b46166 Added a GCCalculatorWalker for a oneoff analysis for Mark Daly (GC content of agilent 1.1 targets)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3395 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 18:49:51 +00:00
aaron 0036df7b03 adding a convenience method for getting at the RODs that overlap a specific locaiton as GATKFeatures.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3394 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 17:40:20 +00:00
aaron ca386439be only emit a warning if the tribble index is out of date, don't remove and replace it for them. Added a test case where the log4j appender checks the logging messages for the appropriate output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3393 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 15:12:48 +00:00
hanna 017ab6b690 Experimental versions of downsampler and Ryan's deduper are now available either
as walker attributes or from the command-line.  Not ready yet!  Downsampling/deduping 
works in a general sense, but this approach has not been completely optimized or validated.
Use with caution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 05:40:05 +00:00
weisburd 46ba88018d Updated to the new readHeader(..) api
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3391 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 04:06:34 +00:00
weisburd 984c51efd3 Updated to use Tribble-based GATKFeature instead of TabularROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3390 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:42:12 +00:00
weisburd 42ee16f256 Updated to use Tribble-based GATKFeature instead of TabularROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3389 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:41:37 +00:00
weisburd d8469e2fba Updated to use Tribble-based GATKFeature instead of TabularROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3388 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:40:47 +00:00
weisburd d65b2d32d1 Removed AnnotatorROD which has been ported to Tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3387 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:39:34 +00:00
weisburd b82116f488 Removed AnnotatorROD which has been ported to Tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3386 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:39:20 +00:00
weisburd 6b96f025f5 Tribble integration for indexing the AnnotatorInputTable format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3385 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:37:54 +00:00
weisburd 2f3933148d Added fast split(str, delimiter) methodf
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3384 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 03:37:26 +00:00
hanna aedb9f6734 Bring SAMPileupCodec into compliance with new interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3383 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 01:23:29 +00:00
aaron 7cfb9ff3dc updates for Tribble 82, fixes for Ryans case where multiple processes would attempt to read/write to the same index, and a couple other Tribble-centric bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3382 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 19:34:45 +00:00
chartl 635f61c22d Clone the other guy too
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3381 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 18:56:01 +00:00
rpoplin 9e15299475 Misc cleanup in variant recalibrator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3380 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 17:37:01 +00:00
chartl eb200e4cce Hrumph. Don't just add pointers to the same objects, actually clone the underlying arrays.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3379 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 17:13:44 +00:00
chartl e016491a3d Major refactoring of Depth of Coverage to allow for more extensible partitions of data (now can do read group, sample, and library; in any combination; adding more is fairly easy). Changed the by-gene code to use clones of stats objects, rather than munging the interval DoCs. (Fix for Avinash. Who, hilariously, thinks my name is Carl.) Added sorting methods to ensure static ordering of header and body fields.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3377 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 16:58:13 +00:00
weisburd 3c022e4b0c Improved command-line-arg validation at startup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3374 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 02:46:17 +00:00
weisburd 35b4bba35e Refactored so it could be used for knownGene and CCDS as well as refGene
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3372 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 02:44:10 +00:00
weisburd bb86c0e03a Improved error message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3371 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 02:43:13 +00:00
weisburd 68719615be For multiple matches, shifted counter to be 1-based
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3370 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 02:41:50 +00:00
hanna 73e2e32837 Fix typo.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3369 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:04:00 +00:00
chartl ebd0fabf86 First pass updates to annotations to work with indels. HomopolymerRun indel behavior is currently turned off by a global boolean until it's ready to go live.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3368 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:02:13 +00:00
hanna 0791beab8f Checking in downsampling iterator alongside LocusIteratorByState, and removing
the reference implementation.  Also implemented a heap size monitor that can
be used to programmatically report the current heap size.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3367 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 21:00:44 +00:00
chartl b7d21627ab Changes to DepthOfCoverage (JIRA items) and added back an integration test to cover it. Alterations to the design file generator to output all transcripts (rather than choosing one at random).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3366 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-17 17:23:00 +00:00
kiran 4235164359 Removed the confusionMatrix column (of *course* this is a confusion matrix... what else would it be?!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3365 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-14 21:55:37 +00:00
kiran 95b29f608b Specify default values.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3364 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-14 21:42:53 +00:00
rpoplin 6efd05831b Encapsulating annotation decoding function in order to use same fixed random seed in both VariantOptimizer and ApplyVariantClusters
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3363 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-14 20:03:38 +00:00
ebanks 32389dc0a9 Fixed GQ estimate when chosen genotype isn't the most likely according to the GLs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3362 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-14 19:17:46 +00:00
depristo 1538dc0144 optimizer now uses -an arguments instead of exclude and force for clarity. command-line length reduced by 50%
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3361 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-14 15:41:44 +00:00
hanna 88bd7a2045 Reenabling UG parallelization performance tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3360 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-13 16:28:08 +00:00
hanna 0490909285 Fixed epic generic paths fail.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3359 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-13 15:59:57 +00:00
hanna 7ef87e5126 An integration test based on validating pileup to test parallelism in reads, reference, and RODs. This test runs in less
than a minute and fell over instantly in the case of the Tribble parallelism issue.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3358 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-13 15:40:43 +00:00
hanna ceec525420 Got rid of stray unicode characters in copyright message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3357 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-13 14:47:39 +00:00
hanna 3e9ad4bbd0 Porting SAM pileup ROD to Tribble as a case study.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3356 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-13 00:22:59 +00:00
aaron 6839c194cb although holding on to memories can be fun, it's bound to hurt performance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3355 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 19:26:58 +00:00
ebanks c81b910f73 Commenting out the parallelization test which is failing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3354 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 18:39:53 +00:00
aaron cac98ba5ef a couple of small documentation fixes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3353 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 17:40:27 +00:00
depristo 3f07611187 Added support for -nSamples to varianteval (and getNSamplesForEval function). Allows you to calculate AC based metrics for files without genotypes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3350 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 13:36:31 +00:00
aaron 2c55ac1374 fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-12 06:13:26 +00:00
hanna 6868ce988f Fix hanging bug reported by Susanne Pfeifer (tiffy @ get satisfaction) where, if the last read(s) in a shard all have an
indel in roughly the same location and that indel isn't covered by any other reads, LocusIteratorByState goes into an infinite
loop.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3348 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 17:31:19 +00:00
ebanks 34969f304c Adding dbsnp to all UG performance tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3347 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 15:48:05 +00:00
ebanks 140e43b93b Checking in to see whether it fails. If I start getting bombarded with Bamboo error reports, I'm commenting it out...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3346 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 15:39:42 +00:00
ebanks 572b383fe2 Make VA annotate dbsnp again
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3345 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 14:06:53 +00:00
rpoplin b09e7231d1 A quick implementation of the experimental covariates for the TGen folks to work with.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3344 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-11 01:08:52 +00:00
kiran aec5f7b630 Can now threshold results based on minimum base and/or mapping quality.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3343 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 19:58:07 +00:00
kiran 13fd182b7c For dealing with slightly malformatted BAMs - mark every alignment as primary, or in the case of some BAM files from UWash, supply the sample information for each read group.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3340 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 15:17:05 +00:00
kiran 4a7902bb8e Bases 'A' and 'a' (etc.) no longer considered different.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3339 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:53:38 +00:00
kiran ec543b7b62 The Complete Genomics confusion matrix rates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3338 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:52:10 +00:00
kiran b223b04331 Don't list '.' as an alternate allele, dummy!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3337 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:51:18 +00:00
kiran 98718d0faa Computes the error rate per cycle
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3336 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:50:22 +00:00
kiran 7527f950d1 Computes the quality score distribution per readgroup (one column per readgroup)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3335 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:49:38 +00:00
kiran c111c15072 Computes the distribution of insert size per library (for now, one output file per library)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3334 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 14:48:35 +00:00
ebanks a51bd57566 First version of the smart batch merging tool.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3333 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-10 02:18:48 +00:00
rpoplin 33a9549896 Variant Optimizer accepts a dbSNP rod arugment to use in determining known/novel status as opposed to using the rsID in the vcf record. VO generates plots of annotation values used in clustering broken out by knowns and novels. Useful for showing which annotations are approximately Gaussian.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3332 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-09 16:48:07 +00:00
hanna 76efa757f0 Switched over to reviewed version of Picard patch. In process, did some optimization to the IntervalSharder
which improved startup time 5-10x when dynamically merging many BAMs.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3331 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-08 14:12:22 +00:00
depristo 504103bd15 Misc. additions to correct utilities
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3329 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 21:34:18 +00:00
depristo 64ccaa4c6a Walkers and integration tests that calculate and compare callable bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3328 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 21:33:47 +00:00
depristo d070554329 A walker that calculates read lengths, number and size of clipping events
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3327 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 21:32:51 +00:00
chartl 1749a49042 Mapping and base quality thresholds for DoC default to none
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3326 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 18:08:13 +00:00
aaron 7d2df3f511 example windowed ROD walker for Kristian, and updates to Tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3325 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 17:12:50 +00:00
rpoplin 57f254b13a VE integration test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3324 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 13:58:25 +00:00
ebanks 44de92e09d Checking in the liftover script. I am including a post-processing walker to filter out bad records written in under 10 minutes as per my agreement with Mark.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3321 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 12:31:56 +00:00
ebanks 18f1d31a22 Moving to and organizing in core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3320 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 04:05:36 +00:00
aaron 06ea65e60b again for JIRA GSA-320
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3319 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 03:47:58 +00:00
aaron ac9b32db88 a bug fix for Kiran; putting JIRA in for better type determination system for the new Tribble tracks so this doesn't happen again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3318 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 03:31:43 +00:00
hanna 4e0019b04f Repair code that sorts and merges intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3317 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 22:37:25 +00:00
aaron 72e030a670 require that snps be biallelic before we pass them to the TiTv calculation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3316 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 22:33:00 +00:00
rpoplin 7cecec7d00 Removing zero no-calls restriction in AC stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3314 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 18:55:07 +00:00
ebanks 0e58fb7cc0 Moved over to be a walker inside the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3313 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 18:28:03 +00:00
aaron 78409dca0d turned off the progress output from tribble when making an index, and fixing a case where the index file isn't writable so we instead make the index in memory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3312 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 16:36:58 +00:00
ebanks bacc507a48 Don't worry about sorting anymore in the liftover tool. That will come later.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3311 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 15:00:30 +00:00
ebanks 5df0361bd2 trivial removal of unnecessary comments
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3309 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 03:51:14 +00:00
ebanks 2975e3a4e8 picard Intervals don't sort right - switching to GenomeLocs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3308 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-06 03:50:28 +00:00
ebanks 1a99fb9318 First pass at liftover tool. Passing buck over to Aaron...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3306 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 20:38:19 +00:00
aaron a0d71540df speed-up for VCF, adding code to the VCF reader to automagically make an index if one doesn't already exist, and a change to the VCF writer unit test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3305 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 20:19:42 +00:00
aaron 6bbcc47b5d removing some out-of-date RODs and some unused genotype writer formats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3304 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 19:07:13 +00:00
aaron c998c48a23 adding code to detect out-of-date index files, which we now remove and regenerate if the target file is newer than the index file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3303 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 17:55:36 +00:00
aaron a68f3b2e9c VCF moved over to tribble.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3302 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 17:28:48 +00:00
aaron ad11201235 adding more ROD pile-up tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3301 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 16:01:11 +00:00
asivache 0338345bee Fixing the issue with reads having insertion immediately followed by a S/H cigar element causing out of window error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3300 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 15:42:27 +00:00
ebanks 64640d6b17 Complete the switch statement to deal with all possible cigar operators for Kris.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3299 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 13:41:05 +00:00
aaron f75e54e3f7 fixes for new package names in tribble 74
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3298 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 05:47:04 +00:00
chartl 617542853f Walker that can be used with refGene and a TCGA bed file to annotate intervals in an interval list with the genes and exons they overlap.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3296 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 02:55:01 +00:00
chartl 354262eabe New convenience methods to rodRefSeq for dealing with intervals that may be a superset of multiple exons. Needed for next commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3295 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-05 02:54:18 +00:00
ebanks 03bea70f3a Fixed edge case bug in cleaner: when no -L argument is used and a target interval abuts the end of the reference genome, we'll NullPointer at the first unmapped read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3293 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-04 16:49:21 +00:00
kiran 510b3efcc2 Fixed an issue where asking for the alternate alleles at hom-ref sites would result in an array out-of-bounds exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3292 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 18:46:33 +00:00
sjia 94b51de401 HLA caller updated to examine class II loci, updated pointers to dictionary, allele frequencies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3290 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 14:54:52 +00:00
rpoplin 97fdd92e7b Clean up the code to have a unified approach to calculating p(true) for both with and without ti/tv models
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3289 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 13:30:20 +00:00
aaron f497213933 DbSNP moved over to tribble
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3288 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-03 06:02:35 +00:00
rpoplin 9d01670f62 Major update to the Variant Optimizer. It now performs clustering for both the titv and titv-less models simultaneously, outputting the cluster files at every iteration. It makes use of the Jama matrix library to do full inverse and determinant calculation for the covariance matrix where before it was using only approximations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3286 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-02 19:21:23 +00:00
weisburd a318b1871d Removed unused column
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3285 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 21:29:34 +00:00
ebanks 9dff578706 Added PG tag to bam header to let people know it's been cleaned.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3284 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 17:30:30 +00:00
ebanks 0e10359a5e Okay, finished up the ability to cap a base's qual by its read's mapping quality.
This is experimental - I have not tested its performance on SNP calling, or even played around with it.  If you want to test it out, go nuts.  But don't come running to me if your results are not good.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3282 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 16:58:30 +00:00
ebanks 850f36aa61 Changes to the Unified Genotyper's arguments:
1. User can specify 4 confidence thresholds: for calling vs. emitting and at standard vs. 'trigger' sites.
2. User can cap the base quality by the read's mapping quality (not done yet).
3. Default confidence threshold is now Q30.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3281 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 16:44:24 +00:00
weisburd 8b2ce128b5 Optimized the join(..) method.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3280 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 15:55:07 +00:00
hanna 8bb15ef812 Checking in the reference implementation of the downsampler for back comparison.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3278 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 15:41:13 +00:00
ebanks 1714c322c2 Reorg of UG args; checking in first before upcoming changes that will break integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3274 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 14:48:46 +00:00
weisburd ba78d146ec Finished implementing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3273 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 14:14:31 +00:00
weisburd 5d5c7f9d34 Changed short code of stop codon to 'stop'
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3272 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 13:55:52 +00:00
aaron cbed0b1ade Adding GeliText tribble track as the first enabled Tribble track. This mean 'Variants' is no longer valid for a ROD type, use GeliText instead. I've updated all the references in the codebase.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3271 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-29 22:50:17 +00:00
aaron 7fbfd34315 adding the GELI ROD validation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3270 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-29 21:43:00 +00:00
chartl 82818a417b Allow header fields to come in any order...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3269 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-29 18:33:10 +00:00
hanna 4617abf1ff Fix bug in the interval sharder in cases where contigs specified in intervals are not present in any supplied BAM file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3268 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-28 20:42:04 +00:00
chartl e2ff4167af Added "#Family ID" as a possible header value for PlinkRod ... since that's in the new sequenom headers for pilot 3 validation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3266 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-28 18:38:33 +00:00
depristo 5dce16a8f1 Better genotype concordance module. Code refactoring for clarity (please see below/after for educational purposes). Now reports variant sensitivity, concordance, and genotype error rate by default. Also aggregates this data across all samples, so you get a per sample and overall stats for each of these in the allSamples row.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3265 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-28 13:10:11 +00:00
aaron 64c5f287c5 fixes for edge-cases when using reflections to find classes outside of the main jar. Will push as a patch to reflections
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3264 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-27 17:46:46 +00:00
aaron c647153b10 Adding Jama for Ryan.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3262 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-27 14:30:36 +00:00
aaron f6468f9143 a fix for a bug we've worked around in the reflections package: previously it didn't find classes that weren't in the main jar. Fixed in this version.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3261 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-27 04:49:49 +00:00
ebanks df31eeff9f minor change
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3259 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-26 06:05:29 +00:00
aaron 68bdac254b a utility walker for validating changes made to the underlying ROD system in the transistion to Tribble.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3258 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-26 05:21:24 +00:00
ebanks d9bf441391 Have UG emit calls at sites from one or more 'trigger' tracks when provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3257 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-26 05:04:43 +00:00
ebanks 8f2bfac7a6 Bug fix for NullPointerException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3256 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-26 05:02:09 +00:00
ebanks f5a3b128c8 Fixing bug that's not caught by integration tests:
If the first eval seen has one or more no-calls, then that's the 2N chromosome count that gets set as the max for the metrics.  Instead, just check that any eval's no-call count is 0.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3255 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-26 02:40:34 +00:00
depristo 29ab59a7b3 Bug fix for Kiran; insertions now get a null reference allele even if the ref input object is null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3254 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-24 21:31:03 +00:00
aaron c8d09a29ed some quick changes to the VE output system - more to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3253 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 21:55:08 +00:00
depristo 7f4d5d9973 Ti/Tv by AC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3252 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 17:56:29 +00:00
ebanks 42bcca1010 Pulling out the left-alignment code for indels so that other walkers can use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3251 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 16:12:34 +00:00
weisburd 9e28e4eb42 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3250 348d0f76-0448-11de-a6fe-93d51630548a 2010-04-23 15:50:09 +00:00
weisburd 10bcd72593 1st attempt to implement extra columns
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3249 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 15:49:37 +00:00
weisburd a72a5a7b1a Data object for representing a single amino acid
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3248 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 15:49:06 +00:00
rpoplin e7c0ded40e Fixed long-standing bug in GenotypeConcordance module of VariantEval which caused incorrect numbers to be displayed in the concordance table. The format of the concordance table has changed. Added a concordance summary table which gives overall genotype concordance summary stats by sample. None of the VE integration tests contained genotype information so I added a comp track with genotypes to one of the tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3247 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 15:48:41 +00:00
ebanks e0b51d0df0 Trigger cleaning of duplicate reads. Also beeter debug output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3246 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 15:12:28 +00:00
ebanks 3adf7fbf64 bug fix for known-indels used as consenses
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3245 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 13:52:51 +00:00
aaron f050beada6 make sure we do delete the temp file we create
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3244 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 05:32:49 +00:00
aaron 536f22f3bd adding VC adaptor for GELI, along with unit tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3243 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 05:28:39 +00:00
depristo 3d2c836db6 Bug fix for case sensitivity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3242 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-23 03:08:58 +00:00
ebanks 8c94df6f00 Bug fix for Chris: deal with sites that have "semi-deletions"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3241 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 18:34:41 +00:00
chartl 121163dd49 interim commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3240 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 13:44:45 +00:00
weisburd f0fe2ea530 A simple codon -> AA lookup table
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3239 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 12:18:00 +00:00
weisburd e643a9e7a5 Takes a refGene table ( -B arg must be: -B refgene,AnnotatorInfoTable,/path/to/refgene_file.txt) and generates the big table of nucleotides containing annotations for each possible variant at each transcript position (eg. 4 variants for each position).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3238 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 12:11:19 +00:00
weisburd 653e08c0b6 Takes a refGene table ( -B arg must be: -B refgene,AnnotatorInfoTable,/path/to/refgene_file.txt) and generates the big table of nucleotides containing annotations for each possible variant at each transcript position (eg. 4 variants for each position).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3237 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 12:11:03 +00:00
weisburd 20379c3f82 Added location-caching optimization, temporary attributes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3236 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 11:35:45 +00:00
ebanks 84ebceb9a6 Fix for Chris: need to use the appropriate conversion method. Added a warning to the adaptor.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3235 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 02:05:10 +00:00
chartl e7334ec11f Checkin for Eric (IndelDBRateWalker is a prelude to a VariantEval module for comparisons for indels)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3234 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-22 00:40:27 +00:00
hanna 32d86cf457 Rev the reservoir downsampler to support partitioning through a functor.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3232 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 19:50:26 +00:00
asivache ef6d900eb8 for now, set log error to -1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3231 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 19:21:06 +00:00
ebanks e9e844fbf5 1. Reverting: dbsnp automatically is a comp
2. Fixing logic for min Qscore calculation


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3230 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 18:51:35 +00:00
asivache 532263ea25 Oooops, forgot to update the test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3229 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 18:38:24 +00:00
asivache 1373fee278 Because of the ugly VCF format, generic addCall() method of GenotypeWriter interface acquired an additional parameter, explicitly specified reference base (in VCF it's the base immediately *before* the event in case of indels, so we got to pass it). All implementing classes are modified to accomodate the change.
VCFGenotypeWriterAdapter now explicitly uses the passed reference base instead of deriving it from VatriantContext (in SNP mode as well!), other writers simply ignore that additional argument. 

SimpleIndelCalculationModel now WORKS (or rather, it does produce calls :) )

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3228 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 18:19:03 +00:00
hanna ab34397d2e Continuing to stamp out the non-ASCII copyright virus.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3227 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 14:50:45 +00:00
chartl 84f1ccd6ac Two dumb oneoff walkers written to fix & annotate the Baylor indel calls (which came in sans reference, and without coding/intron annotations).
ERIC -- does the IndelAnnotator (the RefSeq lookup code I stole from IndelGentoyperV2) want to be its own Annotation inside VariantAnnotator? Is Andrey already doing this as part of adding indel calling to UG?



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3226 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 14:04:10 +00:00
depristo 2fdc1cf490 Bed ROD track support
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3225 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 13:22:42 +00:00
depristo 51b3998082 deleting unused code from VariationFiltration
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3224 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 13:22:19 +00:00
ebanks 4abd3b0b7b Fixing known/novel calc now that dbsnp isn't a default comp track
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3223 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 05:43:59 +00:00
ebanks 114819d980 Allow user to set min confidence score for comp tracks too
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3222 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 05:09:09 +00:00
ebanks 3db73e0791 Renaming for consistency
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3221 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 03:00:43 +00:00
ebanks 3b5673d967 1. Removed -all; by default all modules are used; use -none for no modules.
2. Don't make dbsnp track be a comp by default (to cut back on output). Please let me know if someone wants this back for some reason.
3. Cleaned up dbsnp module output to print the right numbers.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3220 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-21 02:46:42 +00:00
aaron 4e18c54bb8 fixing a couple of commented out portions of the VCFReader test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3219 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 22:20:35 +00:00
asivache 6fda78f93f Always return deleted bases in upper case
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3218 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 19:17:40 +00:00
asivache 52a570637d Always keep event bases in upper case
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3217 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 19:16:39 +00:00
aaron 80c4f88a72 removing the Variation interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3216 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 18:56:45 +00:00
asivache 7d952a34ae Fixing copyright note
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3215 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 18:28:57 +00:00
asivache cdc175f7e3 Synchronizing version to make sure everything compiles; this model is not operational yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3214 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 17:41:52 +00:00
asivache 4437456bb5 Pass array of ref bases to callExtendedLocus()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3213 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 17:41:13 +00:00
asivache 5d2fab93f4 Method signature changed: for extended events, pass array of reference bases (to ensure we cover the full length of the indel event), not just reference base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3212 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 17:40:30 +00:00
asivache 01e6492ba9 Updated to work correctly with extended pileups. Clogged and uses some dirty tricks; pileups/extended pileups need to be redesigned someday
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3211 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 17:38:09 +00:00
asivache 4723cad1be New method: getBasesAtLocus(int n); for the windowed reference context, this method extracts n bases starting at the current locus (NOT at the window start, so this method is an extension of getBase())
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3210 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 17:35:09 +00:00
asivache cac125b35c Fixed incorrect symbol printed into the output file (tag had 'R', should have had 'T')
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3209 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 16:37:28 +00:00
rpoplin f4977965b6 Removing debug statements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3208 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 16:22:40 +00:00
rpoplin 124b7a2a58 Moved ApplyVariantClusters over to VariationContext
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3207 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 16:20:25 +00:00
asivache 200d3e2c47 added copyright note
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3205 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 15:44:26 +00:00
asivache 546dfb629e A draft (working) version of a tool that computes per-cycle base qualities averaged across the reads; the computed base qual profiles are stratifeid by lane/read end and separately by library.Come and shoot me if we already have such a tool somewhere in the repository :)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3204 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 15:38:16 +00:00
hanna c1e53d407d The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had
unicode quote characters embedded in it.  These characters were invisible inside
IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason
have a different default charset.  Fixed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 15:26:32 +00:00
aaron b5f6f54968 Almost done removing any trace of the old Variation and Genotype interfaces.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3202 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 14:52:15 +00:00
hanna 818a95ea6e Test of new copyright message without unicode characters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3200 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 14:14:54 +00:00
rpoplin 00feb3eee0 Moving over to VariationContext in CountCovariates. Removed references to class Variation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3199 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 13:26:22 +00:00
hanna 1bc26f69e9 An attempt to cleanup the Utils directory. Email to follow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 23:00:08 +00:00
hanna c08936d6f4 Added a reservoir downsampler which can sample elements in an iterator uniformly
from a stream (see Vitter 1985).  Thanks to Eric and Andrey for the pointer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3197 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 20:48:14 +00:00
ebanks c44f63c846 Fixing the performance tests: we need to catch the RuntimeException (not samtools' RuntimeIOExcpetion). Also, CountCovariates doesn't need the catch.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3196 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 14:28:12 +00:00
ebanks abf48cee05 Moving over to VariantContext from Variation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3195 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 06:56:29 +00:00
ebanks d73c63a99a Redoing the conversion to VariantContext: instead of walkers passing in a ref allele, they pass in the ref context and the adaptors create the allele. This is the right way of doing it.
Also, adding some more useful integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3194 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 05:47:17 +00:00
aaron 131703d9db more clean-up: moving AlleleBalanceInspector to archive.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3192 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 20:53:33 +00:00
ebanks 534f24177a Move to VariantContext and improve performance (and ease of use) by transitioning to be a RODWalker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3191 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 20:09:48 +00:00
ebanks 8c32bb8f0a Complete the move over to VariantContext so that we can remove dependence on Variation (in the VCF code)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3190 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 19:41:42 +00:00
aaron 821e8b1c5f more cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3189 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 19:16:16 +00:00
aaron e11ca74eb5 removing some outdated ROD classes (PooledEMSNPROD and SangerSNPROD), removing an out-of-date interface (VariantBackedByBenotype), and moving AnalyzeAnnotationWalker over to VariationContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3188 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 18:59:29 +00:00
ebanks d5e5589b8f No longer used
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3187 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 17:57:39 +00:00
aaron be7cbf948b adding a catch for the exception thrown by samtools when it attempts to close /dev/null in the performance tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3186 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 17:41:48 +00:00
aaron 4d75b26b7a Removing the code that made the ROD system case insensitive. Anyone using specific ROD names in their classes should take care in naming required tracks; All lowercase is the best practice.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3184 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 06:17:31 +00:00
asivache 6dc1275cfb Utility method added: getQualsInCycleOrder(read) - examines the read and returns its quals in the order the machine read them (i.e. always from cycle 1 to cycle N). Simply inverts quals if the read happens to be rc-aligned :)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3183 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 00:15:57 +00:00
ebanks f4673efd2f Moving to archive as it's no longer supported
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3182 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 22:10:42 +00:00
ebanks 02a6f4c401 Moving over to VariantContext
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3181 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 22:07:28 +00:00
ebanks 7adff5b81a Renaming for consistency
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3180 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:36:19 +00:00
ebanks e702bea99f Moving VE2 to core; calling it "VariantEval" (one more checkin coming)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3179 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:25:47 +00:00
chartl ac6f6363ce Execs() temporarily disabled after removal of bam file. New tests forthcoming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3178 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:11:56 +00:00
ebanks ac9dc0b4b4 Removing VariantEval (v1); everyone should be using VE2 now. Docs coming ASAP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3177 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 19:53:02 +00:00
ebanks 3330e254a9 Standardize the dbsnp track name in preparation for case-sensitivity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3176 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 19:41:57 +00:00
ebanks 5f7564bf0a Better naming of output columns
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3175 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 18:08:07 +00:00
aaron e682460c1f add a fix so that XL arguments won't cancel out -BTI arguments, fixed a bug for Ben where the ROD -> interval list conversion was throwing an exception, and some old code removal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3174 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 16:31:43 +00:00
aaron b54031fc86 adding an experimental format to VariantEval2, which when you source() from R, imports all VE2 output as individual tables with appropriate row and column names. More testing and feedback needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3172 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 06:09:27 +00:00
ebanks 04909fa6ad Removing arbitrary selects
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3169 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 17:46:39 +00:00
ebanks f1189bac5a Bug fix: final map call wasn't being triggered (because we returned when ref==null before applying update0)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3168 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 16:58:55 +00:00
weisburd b930dc52a5 Integration test for GenomicAnnotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3167 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:43:25 +00:00
weisburd c0f4695902 Improved handling of haplotypeReference and haplotypeAlternate columns. Added haplotypeStrand column. Improved handling of empty fields in data files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3166 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:42:19 +00:00
weisburd 74ec72d1ac Added AnnotatorROD - the TabularROD format specific to GenomicAnnotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3164 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:39:50 +00:00
weisburd 77a6608784 Changed a variable name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3163 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:38:18 +00:00
weisburd 7b8056099c Fixed 'N' reference-base handling, changed some comments, var names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3162 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:37:25 +00:00
ebanks dde092fb61 Added the ability in VE2 to select which eval modules to run, so that you aren't forced to use all of them. You can use --list to list all of the possible modules to run.
Heads up everyone: by default, *no* modules are run.  Please add "-all" to your scripts to maintain the previous behavior.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3161 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 22:15:58 +00:00
ebanks 0b575596f8 Fix for concordance: samples found only in truth no longer kill it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3160 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 21:33:49 +00:00
hanna 8573b0bc6f Refactoring intervals, separating the process of parsing interval lists,
sorting and merging interval lists, and creating RODs from intervals.  This
gives Doug the ability to keep using our interval list parsing code when
sorting intervals on our behalf.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3159 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 15:50:38 +00:00
weisburd d0123956bc Modified comments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3158 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 15:41:59 +00:00
chartl 7b05091c04 DoC now does not require a -o argument. (Change for Matt)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3157 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 13:58:17 +00:00
ebanks e413882302 Generalizing the SequenomValidationConverter to be able to take in any arbitrary rod type (provided it can be converted to VariantContext).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3155 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 20:42:18 +00:00
hanna 14b8101d45 Error message fail. Failed to supply one of the valid interval file types.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3153 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 01:19:01 +00:00
hanna 60d54e69f3 Hackish fix to present a better error message if the file does not have the proper extension. Will work with Brett to come up with a better solution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3152 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 01:11:27 +00:00
ebanks d06c7835d8 Adding performance tests for the indel realigner; should take ~3 hours.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3151 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 04:45:22 +00:00
ebanks 3434a61146 Don't trigger when ref=N (which can happen when a dbsnp track is provided)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3150 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 02:59:11 +00:00
ebanks 961ca05abc Removed outdated Sequenom rod and renamed HapMapGenotypeROD to HapMapROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3149 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 01:43:07 +00:00
ebanks fa01876255 UnifiedGenotyper performance tests (WG, WEx); currently takes just over an hour.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3148 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 19:42:29 +00:00
ebanks 0cc6d0fbbb One more quick memory improvement: reuse Alleles in a given context instead of creating new ones for each sample (duh).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3147 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 18:48:36 +00:00
rpoplin c2a37e4b5c Variant Quality Score modules in VariantEval2 no longer create huge lists which hold all of the quality scores encountered and instead cast the quality score to an integer and use hash tables. Bug fix for files in which all the quality scores are set to -1.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3146 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 18:36:06 +00:00
ebanks 71f38a9199 Adding performance tests for the recalibrator (Whole Genome and Whole Exome tests).
Should take ~3 hours to run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3145 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 18:30:59 +00:00
ebanks e73e6a4fb0 Significant memory improvements to plink code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3144 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 16:12:38 +00:00
rpoplin f1b1e70612 Bug fix for multisample calls in ApplyVariantClusterWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3142 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 12:01:15 +00:00
ebanks 3f2455e346 Better error message as suggested by James P
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3141 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 05:52:53 +00:00
ebanks fba48b515a Heads up everyone:
For consistency, these tools should be writing to the walker's output stream and no longer use the -vcf argument.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3140 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 05:37:25 +00:00
ebanks e286623f6f Use byte[] instead of String in an attempt to cut down on memory usage
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3139 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 05:32:54 +00:00
chartl 7025f5b51d Added an auxiliary table to DepthOfCoverage, which is the cumulative equivalent of the locus table (got tired of doing the calculation by hand). Also took care of a trailing tab in the per-locus output table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3138 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 19:37:17 +00:00
aaron 9f6377f7fb added a performance test build option (for the upcoming performance test suite), and added a sample performance test for VariantEval.
IMPORTANT: it was really redundant that we had -Dsingle and -Dsingleintegration to run single unit tests and integration tests, now you can just use -Dsingle to run a single test for performance, unit, and integration tests.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3136 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 15:37:15 +00:00
aaron 4014a8a674 A long overdue correction; all unit tests now end in 'UnitTest'. This was something we wanted to do for a while, and now with the performance tests coming, it was a good time to clean-up. Please label any new test appropriately: *UnitTest and *IntegrationTest are the two valid file name patterns for tests.
Thanks!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3135 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 06:14:15 +00:00
aaron e148a3ac61 added the ability to create interval lists directly from a ROD, using the command line arg '-BTI' (long name '--rodToIntervalTrackName'). The parameter to this arg is the name of the ROD track, which must be a track name specified in the -B option.
Using this feature, sites covered by the target ROD will be iterated over.  This list of intevals generated is merged with any intervals from the -L and -XL args, and the Walker is run over the resulting merged list.

WARNING: for very large ROD's this can be costly.  Consider this experimental for now.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3134 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 05:14:41 +00:00
aaron 20cc2a85a4 removed the hashmap from Genotype Concordance, moved it into a table
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3133 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 21:24:48 +00:00
aaron e55f27b3b1 forgot a file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3132 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 20:51:13 +00:00
aaron 9ca8e345fc by-by old junk.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3131 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 20:41:48 +00:00
aaron 8fd59c8823 Modified the report system based on Ryan's feedback: tables are now created independently to avoid the permutation problem when they were all compressed in rows, and removed our dependency on FreeMarker. The Grep format stays the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3130 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 20:39:55 +00:00
depristo 918b746798 More detailed validation output. Fixes for genotyping overflow -- these are temporary and need to be properly resolved
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3129 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 16:38:28 +00:00
ebanks e7dad728df Trivial output changes for consistency
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3128 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 14:47:43 +00:00
depristo 058e7d3d12 Bug fix for Gregory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3127 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 00:21:35 +00:00
rpoplin 7b44e6bd55 ApplyVariantClusters now outputs interesting threshold points based on hitting the target novel TiTv
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3126 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-06 19:47:29 +00:00
rpoplin 60c227d67f Added new VE2 module to create a plot of titv ratio by variant quality score
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3125 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-06 15:19:27 +00:00
asivache 3530ef5a41 Explicit type cast fixed in order to work with new ROD implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3124 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-06 15:02:56 +00:00
rpoplin 2d002c56c3 Added histogram of variant quality scores broken out by true positive and false positive calls to the GenotypeConcordance module of VariantEval2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3123 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-06 13:48:31 +00:00
aaron 12e4f88ca7 a little bit more clean-up
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3122 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-05 20:49:06 +00:00
aaron df7e7921ce removing some unused code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3121 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-05 19:30:08 +00:00
ebanks 56eb15f91f Error checking for bad input (thanks, Aaron).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3120 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-04 03:17:01 +00:00
weisburd 705b28e90d First attempt at implement record filtering based on special 'hap_ref', 'hap_alt' columns in the input files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3118 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-02 21:52:26 +00:00
weisburd d78e7f6c0a Added documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3117 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-02 21:51:28 +00:00
aaron 8017fb123f changed the depth of coverage walkers class name, and added a dependency in the packaging system so that RODs will all get imported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3116 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-02 20:55:19 +00:00
weisburd 6b7b07f178 First checkin of GenomicAnnotator which annotates an input VCF file by pulling data in a generic way from an arbitrary set of TabularRODs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3114 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-02 17:49:42 +00:00
rpoplin 642c969896 reverting optimizer changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3112 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-02 16:59:13 +00:00
chartl d7880ef7ad Forgot to uncomment the AlignerIntegrationTest before committing. And yes, matt, commenting it out is, in fact, easier than just setting my classpath.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3110 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 17:17:16 +00:00
chartl f7d1b8f5de CoverageStatistics has now replaced DepthOfCoverage -- old DoC is in the archive.
Also, I can't be bothered to fix the spelling of "oldepthofcoverage" to contain the necessary number of D's. Be content that it does, however, contain the requisite number of O's.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3109 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 16:27:23 +00:00
aaron 585cc880a2 changed jexl expressions to jexl names in the VariantEval2 output, fixed integration test, and fixed a problem where a line was getting dropped in CSV output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3108 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 16:23:14 +00:00
hanna d00bde22db Reverting one of Brett's changes that should not have been committed. Will
address with Brett separately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3107 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 16:10:46 +00:00
bthomas b4f6f54502 Reorganizing the way interval arguments are processed
Most of the changes occur in GenomeAnalysisEngine.java and GenomeLocParser.java: 
-- parseIntervalRegion and parseGenomeLocs combined into parseIntervalArguments
-- initializeIntervals modified
-- some helper functions deprecated for cleanliness
Includes new set of unit tests, GenomeAnalysisEngineTest.java

New restrictions: 
-- all interval arguments are now checked to be on the reference contig
-- all interval files must have one of the following extensions: .picard, .bed, .list, .intervals, .interval_list



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3106 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 12:47:48 +00:00
aaron c3c6e632d1 support for two new VCF header info field value-types, Flag (for fields that are just boolean truths), and Character (for single charatcer info fields).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3105 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 03:11:32 +00:00
aaron 3d3d19a6a7 the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date.
a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects.  This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc.  This layer is needed so we can integrate Tribble tracks (which don't natively have names).  Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc).

Sorry for the inconvenience!  More changes to come, but this is by far the largest (as has the greatest effect on end users).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 22:39:56 +00:00
hanna 4fcee248f9 For Kristian: functions which, given a read, can uniquely identify the BAM file storing that read.
Introducing this into the pile of code which peeks under the covers of the SAMDataSource in the hopes
that this function can help to replace the others and provide a single path for crosstalk.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3103 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 20:46:44 +00:00
rpoplin d58fe70708 Correctly ignore filtered calls and indel calls in the truth sets
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3101 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 14:33:01 +00:00
hanna b60197ae10 Another round of cleanup and simplification in Picard -- Picard's unit tests
are now passing for my branch.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3100 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 01:02:59 +00:00
depristo 40f8e7644c Better, multi-haplotype aware haplotype scores. Looking very good now, seems to be vastly better at dealing with incorrect calls in deep and low pass data. Almost ready for use
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3099 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 23:57:36 +00:00
depristo f992f51a3b Deleting incorrect sampling genotype likelihoods from the codebase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3098 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 23:56:35 +00:00
kiran b9d3fc3fbb Now checks if the i-th element of the FiltrationContext[] is null before trying to access it. This seems to happen occassionally at the very end of a VCF file... the array will be 6 elements long, but the last element will actually be null.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3097 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 22:40:17 +00:00
hanna 400684542c Revisions to take into account finalization of Picard patch: naming changes, better definition
of public interfaces.  This won't be the last Picard patch, but it should be the last big one.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3096 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 19:28:14 +00:00
aaron b00d2bf2bc fixing an annotation that was breaking the error log output system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3095 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 15:34:04 +00:00
aaron a6e8687d71 implementing a clean way to import the template files into the GATK jar (they should not always get bundled). All further resources should be added to the gatk.resources path id in the build script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3094 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-30 04:20:19 +00:00
ebanks babb9fb825 snp cluster filter should ignore ref calls when determining the clusters
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3093 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 17:57:33 +00:00
chartl 24461a2503 Let's *not* import classes that no longer exist. How my own ant test compiled is beyond me.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3091 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 13:59:01 +00:00
chartl dc802aa26f Moved CoverageStatistics to core. This will be (soon) renamed DepthOfCoverage; so please use CoverageStatistics
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3090 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 13:32:00 +00:00
ebanks 1e8b3ca6ba Fare thee well, oh LocusWindowTraversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3089 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 13:17:26 +00:00
depristo 8ea98faf47 Deleting the pooled calcluation model -- no longer supported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3088 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 11:44:27 +00:00
hanna 85037ab13f Fix for Kiran's sharding issue (Invalid GZIP header). General cleanup of
Picard patch, including move of some of the Picard private classes we use to Picard public.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3087 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 03:21:27 +00:00
depristo a45ac220aa Removing unnecessary printing routines
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3086 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-28 22:34:54 +00:00
depristo b8ab74a6dc Minor useful changes to BaseUtils and MathUtils to support a new haplotype score annotation that determines to the two most likely haplotypes over an interval and scores variants by their consistency with a diploid model. Appears to be useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3085 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-28 21:45:22 +00:00
kshakir e9e53f68ab Filter lists can now end with .list or .txt.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3084 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-27 17:41:24 +00:00