Commit Graph

5660 Commits (3907377f3724bdadc580a024e3ea671d15e92e7e)

Author SHA1 Message Date
rpoplin 3907377f37 When genotyping given alleles, for multiallelic sites we go back to the reads and use the alternate base with the highest sum of quality scores instead of taking the first alternate allele from the vcf file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5701 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 21:31:09 +00:00
droazen 6e9e766a71 The tighter interval validation wasn't interacting well with unmapped
intervals -- altered the validation methods to not throw an error for 
unmapped intervals.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5700 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:56:46 +00:00
hanna 6d5e45b5c6 Revbump Picard dependencies at Tim/Kathleen's request. Exclude anonymous
classes from PluginManager.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5699 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:38:05 +00:00
droazen d650efd40a Fix for bug GSA-449: Intervals that are not in GATK format are not validated
to the same standard as GATK format intervals. Full validation against contig
bounds is now performed for all intervals, regardless of their source. Also
fixed a few tests for validation exclusions that were backwards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5698 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 18:12:10 +00:00
kshakir df35a143b2 Removed -debug/--debug_mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5697 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 10:56:39 +00:00
kshakir ca817356b6 Quick disabling test to restore build. TODO fix test or complete removal of the MFCP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5696 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 04:26:11 +00:00
hanna 27495a0c64 Killed quiet mode. Should probably kill debugMode as well, but Queue's using
it.  Will check with Khalid tomorrow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5695 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 04:17:36 +00:00
kshakir 6b1b4931e7 Added FCP VE stratifications for Filter, FunctionalClass, and Stratification as requested by Corin.
Feeding FCP UG the bam list instead of individual bams to cut scatter gather time from O(m^100) as measured by Chris to O(m^1).
Fixed NPE when eval values aren't found in PipelineTests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5694 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 02:29:56 +00:00
hanna f3dacd3c40 Use ByteBuffer.allocateDirect() instead of ByteBuffer.allocate().
ByteBuffer.allocateDirect() behaves like Java NIO MappedByteBuffers in that
it consumes address space, which counts against our virtual memory allocation;
but cannot be destroyed or otherwise freed.  This was definitely contributing
to the LSF failures that I was seeing, but I'm not yet convinced that it's the
sole source of these virtual memory 'leaks'.  More tomorrow as the results of
my whole exome tests start to roll in.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5693 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 02:01:11 +00:00
chartl 7afeb1ab17 Removing broken imports (boo)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5692 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:55:25 +00:00
rpoplin 379f837e82 RankSum z-scores are looking quite good, so RIP Wilcoxon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5691 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:34:39 +00:00
chartl bc3fd70b0a Removing the old association walker, switching test to just validate that MannWhitneyU is doing the right thing. Unit tests still pass.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5690 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:05:19 +00:00
hanna b915520653 Updating to apache commons math v2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5689 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 17:31:49 +00:00
kshakir 58c7b27ccc Missing file from last checkin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5688 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 00:12:41 +00:00
kshakir f619dd3ca7 Refactored IntervalUtils used to parse and scatter intervals for Queue.
Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals.
Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly.
TODO: general purpose function to divide data evenly.
Skip over comments when parsing picard analysis files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 00:06:00 +00:00
kshakir 6ca4e3cebf Updating FCPT nCalledLoci due to fixed QD<2.0 filter.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5686 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-25 21:37:04 +00:00
kshakir ed6da6f72d Added JavaMail dependencies to Queue package since bcel wasn't picking them up.
Added the ability to add a file path to a package.
Checking for missing files when packaging.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5685 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-23 20:48:40 +00:00
kshakir 1158c99726 Only running chr20 test on the hour queue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5684 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 22:09:42 +00:00
hanna 57a4700299 Ported small BAM performance test suite to the Google Caliper microbenchmarking suite. Looks promising,
but I'm still not sure that GC is a good long-term solution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5683 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 22:09:17 +00:00
kshakir 00b57c751b Added missing ".0".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5682 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 21:50:07 +00:00
chartl a56a2dfdb7 Nothing to see here. Move along.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5681 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 15:01:02 +00:00
ebanks 8bc93046f4 Adding chain files for Mark. Tested by lifting over back and forth between builds. Note that they comprise only the standard contigs so no _randoms or GL000xxx.1s.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5680 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 14:53:15 +00:00
delangel 600617a63c Enabled code to deal with hard-clipping adaptor sequence when processing reads in pileup in indel caller. Proven now that changes are minimal (4 less calls in NA12878 chr20, quals slightly different), minor changes in vcf fields in integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5679 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 14:10:33 +00:00
ebanks e050d94df4 Renaming because they actually map to b37, not hg19
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5678 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 13:34:48 +00:00
ebanks 831ad0cd1a Quit immediately with an error message if any of the individual steps fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5677 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 13:23:33 +00:00
chartl 88735a8c9b Adding in a delta to try and better measure effect size -- equivalent to looking at the lower end of the N^th percentile confidence interval. Kind of a hacky way to add it in, the infrastructure is about due for a streamlining rewrite.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5676 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 03:53:33 +00:00
hanna 7428ae338a A fix for Marian Thieme's NPE in the new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5675 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 19:47:14 +00:00
chartl 5b9a8555cd Queue graph time is currently of O(n^m) where n = num jobs, m = num unique base files. This script therefore was running in order 1200^16, which I don't think would finish before the heat death of the universe. For now, push down the number of files to 1 and gather them outside of Queue, once I've fixed up scatter-gather in core, outputs can be uncommented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5674 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 12:56:25 +00:00
corin 9f006be425 Updates Omni path and removes a typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5673 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 04:17:13 +00:00
ebanks 0007481890 Might as well store these here too even if they aren't used for the resource bundle
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5672 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 04:14:08 +00:00
ebanks cbcdfc584d Moving out of core and into playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 02:30:22 +00:00
depristo cc78027bd3 Two optimizations. Even more aggressive printProgress meter optimization to only even consider doing work once every 1000 cycles through the engine. Second, GenomeLocParser now uses a single indirection around the contigInfo variable. This class uses a last used cache to retrieve efficiently contig information instead of always returning to the underlying SAMSequenceDictionary hashmap to make genome locs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5670 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 01:31:26 +00:00
depristo 29857f5ba6 Fix for instability in output of fasta alternative reference maker when snpmask and snp files are provided and have overlapping records. The order of the records changed due to optimization of the refmetadatatracker, and uncovered this non-determinanism. Now preferrentially masks out includes sites from snps before considering masking out sites in snpmask
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5669 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 21:54:09 +00:00
kshakir 8619f49d20 Added a utility method to retrieve the contig lengths for WG chunking.
Added a rudimentary GATKReportParser for parsing VE3 results.
Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils.
The tag type for .rod files is DBSNP, not ROD.
More explicit return types on implicit methods.
Added null checks for implicit string to/from file conversions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:22:21 +00:00
delangel 59dd79faab One more optimization: don't use Math.round(), but do my own rouding/casting. UG now about 40% faster calling indels, 30-35% faster calling snp's+indels simultaneously.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5667 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:15:58 +00:00
delangel 246d8190b5 Round one of "easy" zero-effort optimizations to UG's indel caller. Mostly inline functions, avoid repeated computation and try to optimize SoftMaxPair() which is by far the bigest runtime hog. More to come...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5666 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 18:57:34 +00:00
depristo d8b8f857f3 V2 -- now working -- of a core walker that creates the standard GATK resource bundle
See https://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

Which live locally in /humgen/gsa-hpprojects/GATK/bundle/current

You use this following command to create the bundle:

java -Djava.io.tmpdir=/broad/shptmp/depristo/tmp -jar dist/Queue.jar -S scala/qscript/core/GATKResourcesBundle.scala --gatkjarfile dist/GenomeAnalysisTK.jar -bsub -jobQueue gsa -svn 5660 $* 

Annoyingly, it must be run in the trunk directory, and requires an explicit svn version number to create the directory.  It also must be run in two stages manually.  First, the local bundle is created, and then with the -phase2 argument all of the files in the local bundle are compressed and pushed to the FTP server.  I'm likely going to shift most of my processes over to using this location for data file access, especially for b37 data sets.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5665 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 12:48:47 +00:00
depristo a8f8077d7a Simple optimizations for cases where there is no data or RODs at sites, such as with the FastaStats walker. private static immutable Lists and Maps in underlying data structures that have no associated data. Also, avoiding a double map.get() in the low-level genome loc parser. RefMetaDataTracker is now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5664 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 10:52:16 +00:00
hanna 54660a8c25 Fix requested by Lee Lichtenstein: first check to see whether it's time for
a progress message, then aggregate metrics.  Makes the overhead of
printProgress in RealignerTargetCreator go from >20% to ~3%.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5663 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 03:22:48 +00:00
carneiro d35c7d1029 - minor changes to the 'justclean' script to handle the Trio Cleaning.
- fixing a bug on single ended BWA option of the data processing pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5662 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-19 16:35:24 +00:00
hanna 49550e257f Fix for JamesP's issue. This issue appeared because of a design flaw in the
interface between SAMDataSource and IntervalSharder that needs to stay around
until the original BAM sharder is retired.  Will add a JIRA to fix design
flaw.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5661 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-19 00:52:13 +00:00
depristo 50e86cfee9 useful chain files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5660 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 19:47:49 +00:00
depristo 541c9109b3 V1 of GATK Resource Bundling system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5659 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 19:23:45 +00:00
ebanks 673772a522 Catch samtools exceptions and make them 'BAM Exceptions' asking the user to run Picard's validator and re-index the file before posting anything to the forum. Let's see whether this helps or not.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5658 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 03:52:43 +00:00
ebanks e97a5ca161 Rename 'verbose' argument to 'debug_file'.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5657 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 03:17:13 +00:00
chartl e28fc21642 Spurious associations can develop from including ambiguous reads in these tests. Perhaps MQ0 reads shouldn't be used for anything except MQ0, but the best way to do that is to restructure the code, so for now I'll put it off.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5656 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-17 23:17:03 +00:00
ebanks 49ea07acce My fixes to Tribble yesterday revealed that some of the test VCFs for integration tests were actually malformed. Also, Guillermo updated the b37 dbSNP VCF and that broke some tests. Should be good for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5655 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-17 03:39:11 +00:00
chartl 23fac043d9 Fix the outputs so the proper files are gathered (not automatic due to multiplexer)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5654 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 23:55:12 +00:00
chartl e5ef8388fc BatchMerge - AlleleVCF --> AllelesVCF, this (combined with Eric's fix) will solve James P.'s forum issue.
After viewing results on real case/control data from RAW -- it's really working quite well. ReadIndels, however, needs to use a T-test rather than a U-test, especially in deep coverage (at indel sites, the reads with indels will have mostly the same number of CIGAR indel elements -- one -- which doesn't really play nicely with the UTest when sample sets are large). Modified ReadsLargeInsertSize to be a two-way test (e.g. ReadsLarge and ReadsSmall). BaseQualityScore also suffers from the same issue as read indels, so switching over to a T-test in that case as well.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5653 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 22:03:16 +00:00
ebanks 1c32deb108 For some reason I wasn't allowing expressions to be used with the -all argument.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5652 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:59:10 +00:00