Commit Graph

4548 Commits (5bade81c6d0c8dbdf0b12ad069f01d0a0b7ef0b7)

Author SHA1 Message Date
rpoplin 5bade81c6d Adding tranche plot generation back to VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5736 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:26:26 +00:00
rpoplin e73720c2db Updating VQSLOD annotation description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5735 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:01:08 +00:00
rpoplin 11052918d9 Better exception text for common error in VQSR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5734 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:37:25 +00:00
rpoplin 4bbce42861 Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
rpoplin 6323fb8673 misc cleanup in VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5732 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:00:22 +00:00
hanna f3bd11a02e Dress up some formatting issues.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5731 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 17:35:18 +00:00
hanna 9c809ed68e A walker to analyze the memory consumption of reference, reads, and RODs at
each base both in bytes and as a percentage of the used heap size.  

May be a bit buggy at this point; there are a lot of metrics around the Java
heap and I'm not completely sure that the metrics I'm outputting are exactly
the ones that I'm looking for. 

Also fixed a documentation bug in my Sizeof class.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5730 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 17:08:15 +00:00
ebanks d4cbd8691c Make the default that we only output SNPs (so that when I make another release we don't get flooded with questions about why the UG is all of a sudden so slow)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5729 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 16:38:55 +00:00
rpoplin 70f8ab6f89 Adding AF bin stratification for VariantEval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5728 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 15:22:50 +00:00
hanna 870e65a685 Fixing a build failure because I want to be completely sure that the code I
checked in immediately following the build breaking code passes integration
tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5727 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 02:09:53 +00:00
hanna 411980a50a Performance enhancements in GATKBAMIndex. Not sure these will assist in a
normal use case, but they cut startup times and memory allocation noise in
the profiler, making my profiling time more productive.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5726 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:48:16 +00:00
delangel 422d4ceeea removed useless file - no need for tableRecalibration, right now everything is done in PairHMMIndelErrorModel
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5725 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:35:44 +00:00
delangel 2a80ffa2ee Totally experimental, barely useable not to be used yet implementation of an "Indel Quality Recalibrator" Idea is that any indel that's not in input dbsnp is treated as an artifact, and then a csv is built with # of indels and # of observations as a function of each input covariate (initially, only cycle, read group and homopolymer run are useful). Then, when computing likelihoods of indels based on input haplotypes we compute gap penalties based on value of covariates at read. Feature is disabled by default with hidden arguments. TBD if usefulness of feature is worth the extra time and pain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5724 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:31:43 +00:00
rpoplin 3224bbe750 New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:14:42 +00:00
ebanks fcf8cff64a We didn't actually support all of these extensions. Updated to be accurate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5722 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:03:46 +00:00
carneiro 34092fd32f minor update...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5716 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 21:29:01 +00:00
carneiro 36ac8beee1 Making the GATK unpredictably random...
through an option! 

set -ndrs if you want the GATK to be really random (non-deterministic). Engine option, available to every walker.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5715 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:29:08 +00:00
carneiro f97e7d2fb4 Walker that calculates the percentage of bases that are covered to at least 20x. Very useful! In oneoffs until someone else thinks it's as useful as I think it is ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5714 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:19:39 +00:00
ebanks deed7c47a1 Continuing the epic fail, some of our existing integration tests were wrong because of the lazy loading failure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5712 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 17:54:41 +00:00
ebanks ab9ffb1a74 Epic failure on the lazy loading of genotypes: if the input VCF had its samples unsorted and we used a walker that didn't require genotypes, then we would sort the samples but not load genotypes (and therefore the genotypes wouldn't match the samples anymore). Added simple integration test to cover this case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5711 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 16:03:45 +00:00
hanna 96571b55be Disable caching of ReadShards by the GenomeLocProcessingTracker (at least
temporarily).  Unfortunately this does not completely fix the IndelRealigner 
exception that Ryan is seeing, but it helps things quite a bit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5710 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 13:59:34 +00:00
carneiro a5b96e0e04 I have to remember that this is Java, not C.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5709 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 17:40:14 +00:00
rpoplin b7334dcc1e Rank sum test annotations are the Z-scores from the test instead of the p-value.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5707 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 14:35:00 +00:00
ebanks 45081c32d7 continuing from last night, the integration tests weren't covering the right behavior either
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5706 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 13:30:57 +00:00
ebanks f34e6d5b8c Somewhere along the way someone broke this tool and failed to update the documentation to boot. Fixing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5705 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 03:16:20 +00:00
ebanks ae8f3f2cde Check for bad reference bases before creating simple/'empty' VCs. Updated the code in the indel GL model to be consistent and to use the existing utility in the Allele class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5704 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 23:55:20 +00:00
depristo 6cce3e00f3 A test walker that does consensus compression of deep read data sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5702 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 22:00:48 +00:00
rpoplin 3907377f37 When genotyping given alleles, for multiallelic sites we go back to the reads and use the alternate base with the highest sum of quality scores instead of taking the first alternate allele from the vcf file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5701 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 21:31:09 +00:00
droazen 6e9e766a71 The tighter interval validation wasn't interacting well with unmapped
intervals -- altered the validation methods to not throw an error for 
unmapped intervals.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5700 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:56:46 +00:00
hanna 6d5e45b5c6 Revbump Picard dependencies at Tim/Kathleen's request. Exclude anonymous
classes from PluginManager.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5699 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:38:05 +00:00
droazen d650efd40a Fix for bug GSA-449: Intervals that are not in GATK format are not validated
to the same standard as GATK format intervals. Full validation against contig
bounds is now performed for all intervals, regardless of their source. Also
fixed a few tests for validation exclusions that were backwards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5698 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 18:12:10 +00:00
kshakir df35a143b2 Removed -debug/--debug_mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5697 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 10:56:39 +00:00
hanna 27495a0c64 Killed quiet mode. Should probably kill debugMode as well, but Queue's using
it.  Will check with Khalid tomorrow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5695 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 04:17:36 +00:00
hanna f3dacd3c40 Use ByteBuffer.allocateDirect() instead of ByteBuffer.allocate().
ByteBuffer.allocateDirect() behaves like Java NIO MappedByteBuffers in that
it consumes address space, which counts against our virtual memory allocation;
but cannot be destroyed or otherwise freed.  This was definitely contributing
to the LSF failures that I was seeing, but I'm not yet convinced that it's the
sole source of these virtual memory 'leaks'.  More tomorrow as the results of
my whole exome tests start to roll in.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5693 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 02:01:11 +00:00
chartl 7afeb1ab17 Removing broken imports (boo)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5692 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:55:25 +00:00
rpoplin 379f837e82 RankSum z-scores are looking quite good, so RIP Wilcoxon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5691 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:34:39 +00:00
chartl bc3fd70b0a Removing the old association walker, switching test to just validate that MannWhitneyU is doing the right thing. Unit tests still pass.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5690 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:05:19 +00:00
kshakir f619dd3ca7 Refactored IntervalUtils used to parse and scatter intervals for Queue.
Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals.
Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly.
TODO: general purpose function to divide data evenly.
Skip over comments when parsing picard analysis files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 00:06:00 +00:00
hanna 57a4700299 Ported small BAM performance test suite to the Google Caliper microbenchmarking suite. Looks promising,
but I'm still not sure that GC is a good long-term solution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5683 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 22:09:17 +00:00
chartl a56a2dfdb7 Nothing to see here. Move along.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5681 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 15:01:02 +00:00
delangel 600617a63c Enabled code to deal with hard-clipping adaptor sequence when processing reads in pileup in indel caller. Proven now that changes are minimal (4 less calls in NA12878 chr20, quals slightly different), minor changes in vcf fields in integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5679 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 14:10:33 +00:00
chartl 88735a8c9b Adding in a delta to try and better measure effect size -- equivalent to looking at the lower end of the N^th percentile confidence interval. Kind of a hacky way to add it in, the infrastructure is about due for a streamlining rewrite.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5676 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 03:53:33 +00:00
hanna 7428ae338a A fix for Marian Thieme's NPE in the new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5675 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 19:47:14 +00:00
chartl 5b9a8555cd Queue graph time is currently of O(n^m) where n = num jobs, m = num unique base files. This script therefore was running in order 1200^16, which I don't think would finish before the heat death of the universe. For now, push down the number of files to 1 and gather them outside of Queue, once I've fixed up scatter-gather in core, outputs can be uncommented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5674 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 12:56:25 +00:00
ebanks cbcdfc584d Moving out of core and into playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 02:30:22 +00:00
depristo cc78027bd3 Two optimizations. Even more aggressive printProgress meter optimization to only even consider doing work once every 1000 cycles through the engine. Second, GenomeLocParser now uses a single indirection around the contigInfo variable. This class uses a last used cache to retrieve efficiently contig information instead of always returning to the underlying SAMSequenceDictionary hashmap to make genome locs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5670 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 01:31:26 +00:00
depristo 29857f5ba6 Fix for instability in output of fasta alternative reference maker when snpmask and snp files are provided and have overlapping records. The order of the records changed due to optimization of the refmetadatatracker, and uncovered this non-determinanism. Now preferrentially masks out includes sites from snps before considering masking out sites in snpmask
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5669 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 21:54:09 +00:00
kshakir 8619f49d20 Added a utility method to retrieve the contig lengths for WG chunking.
Added a rudimentary GATKReportParser for parsing VE3 results.
Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils.
The tag type for .rod files is DBSNP, not ROD.
More explicit return types on implicit methods.
Added null checks for implicit string to/from file conversions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:22:21 +00:00
delangel 59dd79faab One more optimization: don't use Math.round(), but do my own rouding/casting. UG now about 40% faster calling indels, 30-35% faster calling snp's+indels simultaneously.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5667 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:15:58 +00:00
delangel 246d8190b5 Round one of "easy" zero-effort optimizations to UG's indel caller. Mostly inline functions, avoid repeated computation and try to optimize SoftMaxPair() which is by far the bigest runtime hog. More to come...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5666 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 18:57:34 +00:00