After viewing results on real case/control data from RAW -- it's really working quite well. ReadIndels, however, needs to use a T-test rather than a U-test, especially in deep coverage (at indel sites, the reads with indels will have mostly the same number of CIGAR indel elements -- one -- which doesn't really play nicely with the UTest when sample sets are large). Modified ReadsLargeInsertSize to be a two-way test (e.g. ReadsLarge and ReadsSmall). BaseQualityScore also suffers from the same issue as read indels, so switching over to a T-test in that case as well.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5653 348d0f76-0448-11de-a6fe-93d51630548a
Scala type inference for the implicit return types on implicit methods was a little too much for poor IntelliJ IDEA to handle, and it was breaking things like copy/paste, auto-complete, etc.
Also updated the Queue package to include all Sting utils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5646 348d0f76-0448-11de-a6fe-93d51630548a
+ UG now doesn't care whether it's given SNPs or indels to genotype, it will do the right thing -- so remove the option to specify which GM user wants
+ Max misamatches argument removed
integration test will follow
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5638 348d0f76-0448-11de-a6fe-93d51630548a
Switched YAML parser to new Broad parser which will additionally update picard cleaned bams to the latest version if the project and sample are specified.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5634 348d0f76-0448-11de-a6fe-93d51630548a
read metrics are actually a clone, which they can do with as they wish.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5626 348d0f76-0448-11de-a6fe-93d51630548a
Also fixed an "issue" with InsertSizeDistribution -- apparently for mate pairs, the first mate (karyotypically) will have a POSITIVE insert size, and the second a NEGATIVE insert size -- thus the insert size distribution was being conflated with enrichment/depletion of first-in-pair or second-in-pair reads. Gah.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5623 348d0f76-0448-11de-a6fe-93d51630548a
SGA updated to include new proportion-based insert size test.
Major fix for dichotomization test: MathUtils now optionally ignores NaN values for sums, averages, variances. In the future this feature can be pushed back into the AssociationContext object iself (e.g. no data? no entry), but it's kept like this for transparency for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5618 348d0f76-0448-11de-a6fe-93d51630548a
Implicit conversions for String to/from File.
Small updates to the example QScripts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5614 348d0f76-0448-11de-a6fe-93d51630548a
ReduceReadsWalker -- does consensus-based read compression, v2. Does all of the consensus calculations within the ConsensusReadCompressor per sample, and multi-sample case is handled by MultiSampleConsensusReadCompressor. For deeply covered data sets, this projects a significant reduction in the number of mapped reads. Impact on analysis call quality tbd. Expected to be relatively minor, as the system automatically detects regions without a strong consensus, and expands a window around these so that +/- 10bp of all reads are shown around the unclear sites. Not usable yet -- as it does not yet support streaming output, and actually holds all reads in memory at once.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5610 348d0f76-0448-11de-a6fe-93d51630548a
Using the VCFWriterATD isCompressed to check if the VCF index will be auto generated.
Tracking BAM and Tribble indexes as @Inputs and @Outputs in generated QFunctions.
Updates to the BamGatherFunction to disable the index during merge when disable_bam_indexing = true.
Made a shortcut for live-running pipelinetest, pipelinetestrun.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5606 348d0f76-0448-11de-a6fe-93d51630548a
large numbers of contigs. SequenceDictionaryUtils.getCommonContigsByName() was
running in O(n^2) time due to poor choice of data structure -- modified it to
run in O(n) time. Also removed an unnecessary O(n log n) step at another stage
in the sequence dictionary validation process. In tests with a 181,813-entry
sequence dictionary, runtime improved from an average of 21.4 minutes to 45.1
seconds.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5604 348d0f76-0448-11de-a6fe-93d51630548a