Commit Graph

4228 Commits (e5ef8388fc494d6553e167ec8aec3fecdfffa18f)

Author SHA1 Message Date
chartl e5ef8388fc BatchMerge - AlleleVCF --> AllelesVCF, this (combined with Eric's fix) will solve James P.'s forum issue.
After viewing results on real case/control data from RAW -- it's really working quite well. ReadIndels, however, needs to use a T-test rather than a U-test, especially in deep coverage (at indel sites, the reads with indels will have mostly the same number of CIGAR indel elements -- one -- which doesn't really play nicely with the UTest when sample sets are large). Modified ReadsLargeInsertSize to be a two-way test (e.g. ReadsLarge and ReadsSmall). BaseQualityScore also suffers from the same issue as read indels, so switching over to a T-test in that case as well.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5653 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 22:03:16 +00:00
ebanks 1c32deb108 For some reason I wasn't allowing expressions to be used with the -all argument.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5652 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:59:10 +00:00
corin 2cf6a06503 Throwing an error if INFO fields arguments contain whitespace.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5651 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:52:55 +00:00
corin fce6d25075 Moved the reference ID to a meta data field for validity declaration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5650 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:28:56 +00:00
corin 59215dab48 Now writes results to a minimal vcf with annotations included in the INFO field. Must be run with -NO_HEADER to totally remove header for the most bare bones vcf; otherwise also includes command line meta data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5649 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:14:02 +00:00
ebanks fe26954ac6 Minimal support for reading in VCF4.1 files. Added TODOs that need to be fixed or cleaned up to truly support this version. VCF constants updated. Lower-case bases permitted. Please let's make sure to refactor once we're ready to support it for good.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5648 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 18:59:37 +00:00
ebanks 7e9051ea25 The solution to James's bug was just to clean up the code and simplify it. What happened was that functionality that got put into UGCalcLikelihoods was then generalized into the UG engine but then never removed from UGCalcLikelihoods. This knowingly breaks the batch merger, but Chris said he'll take care of it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5647 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 18:05:10 +00:00
hanna 0d7cca169e Sigh.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5645 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 14:37:24 +00:00
hanna 0965020804 Screwed up the doc string.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5644 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 14:30:20 +00:00
hanna be3bad1f61 Low-memory sharding is now enabled by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5643 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 14:22:07 +00:00
ebanks 2830dc70b7 UG can still return null in certain nasty cases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5642 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 20:11:17 +00:00
fromer 8e0f5bc5a5 Prevent NullPointerException in cases where SNP is filtered
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5641 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 19:59:59 +00:00
depristo ee94af3539 Oops, left out of earlier commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5640 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 18:21:16 +00:00
depristo 8ed9c0f518 VariantsToTable now blows up by default if you ask for a field that isn't present in a record.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5636 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 14:42:43 +00:00
fromer b3cd14d10a Since GCcontentIntervalWalker no longer uses any ROD, turn it into a LocusWalker that traverses by REFERENCE
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5635 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 03:15:09 +00:00
aaron 2089c3bdef removing; should of gone to the CGA repo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5633 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 22:17:45 +00:00
aaron da6f2d3c9d adding the capseg tools to the new walker repo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5632 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 22:11:08 +00:00
kshakir 4bb573b1f5 Centralizing a bunch of Broad specific utility functions from code scattered in GSA-Firehose, PipelineTest, custom QScripts, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5631 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 21:29:02 +00:00
ebanks 91d308fc6d temporary patch until Picard (hopefully) fixes the NM calculation to deal with reads that align off the end of the contig
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5630 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 19:18:18 +00:00
ebanks fa6468d167 Remove the adaptor sequence clipping read filter because it is dangerous (it breaks LocusIteratorByState). We'll bring it back to life when ReadTransformers are created. Instead, have the utility code return a new clipped SAMRecord (necessary so that we don't break SNP calling in UG when the indel caller tries to hard-clip the reads).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5629 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 18:47:47 +00:00
hanna 5849e112e1 Fix exception in block weighting minus function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5628 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 17:07:04 +00:00
hanna a36adf0c6b Request from the cancer team -- guarantee via javadoc that the returned
read metrics are actually a clone, which they can do with as they wish.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5626 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 15:10:46 +00:00
delangel 06b1497902 Corrected bad merge.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5625 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 15:02:09 +00:00
delangel 9134bf3129 Long-forgotten change I neglected to commit a while back: add ability for SelectVariants to extracts either SNPs or Indels from combined vcf file. Not the ideal place to do it but it's important to at least have something to split vcfs now that we call snp's and indels combined.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5624 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 14:58:44 +00:00
chartl 8e0d191a70 Added a walker to help sort out which samples in a region are giving signal. Lots of reused code that shouldn't be. Will refactor later.
Also fixed an "issue" with InsertSizeDistribution -- apparently for mate pairs, the first mate (karyotypically) will have a POSITIVE insert size, and the second a NEGATIVE insert size -- thus the insert size distribution was being conflated with enrichment/depletion of first-in-pair or second-in-pair reads. Gah.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5623 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 13:53:31 +00:00
chartl efe6c539ac Re-enabling disabled test. Apparently T-tests are very picky about your using an unbiased variance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5622 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 03:05:50 +00:00
hanna 22a11e41e1 Rewrite of GATKBAMIndex to avoid mmaps causing false reports of heavy memory
usage.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5620 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 23:49:58 +00:00
chartl 36d8f55286 Use the 'standard' arcsine transform
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5619 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 23:11:45 +00:00
chartl 8125b8b901 Old changes to the exome VQSR search.
SGA updated to include new proportion-based insert size test.

Major fix for dichotomization test: MathUtils now optionally ignores NaN values for sums, averages, variances. In the future this feature can be pushed back into the AssociationContext object iself (e.g. no data? no entry), but it's kept like this for transparency for now.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5618 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 23:00:50 +00:00
rpoplin 30a19a00fe Fix for when running with EMIT_ALL_SITES but not GENOTYPE_GIVEN_ALLELES. Still want to emit a site even when over the deletion fraction for example.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5617 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 20:07:06 +00:00
delangel 488622041d Further trivial cleanup: Renamed DindelGenotypeLikelihoodsCalculationModel to IndelGenotypeLikelihoodsCalculationModel
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5616 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 18:00:48 +00:00
delangel 3b424fd74d Enable new indel likelihood model by default, cleanup code, remove dead arguments, still more cleanups to follow. This isn't final version but at least it performs better in all cases than previous Dindel-based version, so no reason to keep old one around.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5615 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 17:54:46 +00:00
depristo 9c36b0a39b Refactored read clipping framework into a generic utilities class, independent of ClipReadsWalker, which now uses this framework. Some more cleanup is really needed, as some of the arguments to the classes are really only useful for ClipReads
ReduceReadsWalker -- does consensus-based read compression, v2.  Does all of the consensus calculations within the ConsensusReadCompressor per sample, and multi-sample case is handled by MultiSampleConsensusReadCompressor.  For deeply covered data sets, this projects a significant reduction in the number of mapped reads.  Impact on analysis call quality tbd.  Expected to be relatively minor, as the system automatically detects regions without a strong consensus, and expands a window around these so that +/- 10bp of all reads are shown around the unclear sites.  Not usable yet -- as it does not yet support streaming output, and actually holds all reads in memory at once.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5610 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-10 13:55:05 +00:00
depristo 13c5f3322d Added argument to avoid writing 0 over all uncovered contigs, so you can just plot chrX, for example
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5609 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-10 13:50:21 +00:00
chartl de4eaa455e Squashing some bugs. Current implementation of AlignmentContextUtils.splitContextBySample() eliminates all sample meta data. Per Mark's request I'm working around this rather than fixing it -- the extender now maintains a mapping from sample id to sample object. Addition of a proportion test for large-insert-size reads, and slight refactoring of code to deal with bad window initialization of subclasses (e.g. chris forgot that constructors aren't inherited)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5608 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-09 21:07:52 +00:00
hanna b4b52cc0fe Reduce unnecessary repetitive accesses to the BAM index file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5607 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 19:28:14 +00:00
kshakir 0a58d7aa1a Marked boolean SAMFileWriterATD arguments as flags so scala generator maps them to Boolean instead of Option[Boolean].
Using the VCFWriterATD isCompressed to check if the VCF index will be auto generated.
Tracking BAM and Tribble indexes as @Inputs and @Outputs in generated QFunctions.
Updates to the BamGatherFunction to disable the index during merge when disable_bam_indexing = true.
Made a shortcut for live-running pipelinetest, pipelinetestrun.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5606 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 18:44:32 +00:00
depristo 866f4fd569 Test version of consensus compressing strategy. Cannot be used, and is being rewritten right now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5605 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 18:37:03 +00:00
droazen 80d547ae71 Fix for bug GSA-445: Sequence dictionary validation can be very slow with
large numbers of contigs. SequenceDictionaryUtils.getCommonContigsByName() was
running in O(n^2) time due to poor choice of data structure -- modified it to
run in O(n) time. Also removed an unnecessary O(n log n) step at another stage
in the sequence dictionary validation process. In tests with a 181,813-entry
sequence dictionary, runtime improved from an average of 21.4 minutes to 45.1
seconds.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5604 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 18:33:10 +00:00
ebanks 4f17004590 Allow walkers to enforce the ordering in which ReadFilters are applied (so that they're now done in the order specified in the walker). Useful if you have a computationally expensive filter (like adaptor clipping) that should only be applied to reads passing all other filters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5600 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 03:34:50 +00:00
ebanks 74755cfd1c Adding a ReadFilter to hard-clip out bases from adaptor sequences. This is actually slightly more correct than having it be part of LocusIteratorByState because it allows us to remove reads that are complete garbage (and there are definitely some) based on the insert sizes. However, although conceptually this is great, it doesn't actually work. 'Why?' you may ask. Because when we hard-clip reads it often changes their start positions... which means that reads are no longer passed to LocusIteratorByState in coordinate order... which makes it (understandably) barf all over the place (and makes for some really fascinating SNP calls). This took me forever to find. I'm going to bed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5598 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 03:15:58 +00:00
hanna fece2167b3 Prototype implementation of protoshard merging when protoshard n and protoshard
n+1 completely overlap.  Gives a small but consistent performance increase in 
non-intervaled whole exome traversals (2.79min original, 2.69min revised). 
Needs a more in depth analysis of optimal shard sizing to determine a true
optimum.

Also renamed a variable because Khalid disapproved of my naming choices.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5595 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 02:09:14 +00:00
hanna 32d502c122 Enable BAM OTF index writing by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5594 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 23:44:25 +00:00
droazen cb3e8aec5e Modified the buildfile and help extractor doclet so that help text is only
extracted from source files that have been modified since the help resource
file was last generated. This significantly speeds up builds where only a few 
source files have been modified, at the expense of making clean builds take 
slightly longer. Here's some performance data gathered by testing the old and 
new versions of extracthelp in isolation and averaging across 10 runs:
old extracthelp, 1 modified source file: 20.1 seconds
new extracthelp, 1 modified source file: 7.2 seconds <-- woohoo! :)
old extracthelp, clean build: 17.8 seconds
new extracthelp, clean build: 20.5 seconds


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5590 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 18:40:53 +00:00
ebanks af09170167 As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 17:03:48 +00:00
kshakir 45ebbf725c Instead of always merging Picard interval files they are optionally merged by Sting Utils.
Disabled the MFCP while the FCP gets an update.
Minor updates to email messages for upcoming scala 2.9.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5588 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 21:12:05 +00:00
carneiro 89bb21d024 typo in the argument description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5587 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:45:32 +00:00
rpoplin 3f3f35dea0 UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:06:24 +00:00
ebanks 1aa4083352 Fortunately this code isn't used by anyone right now, but it needs to be fixed before someone unwitingly does: flags were wrong according to the SAM spec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5584 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 17:16:41 +00:00
hanna b231a40da5 Augment PrintLocusContextWalker with extended event info.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5583 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 13:42:48 +00:00