Commit Graph

969 Commits (b316c9a5909bf29301e9d4643c97d86e80736a2a)

Author SHA1 Message Date
depristo b316c9a590 Renamed StratifyAlignmentContext to AlignmentContextUtils, and StatiefyContextType to ReadOrientation. Also, went through the system and deleted all references to second bases. That ship passed long ago. This was the actual commit, the last was an intellij error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5564 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 15:36:17 +00:00
rpoplin 09e89c8c97 Adding ReadPos rank sum test. Transitioned rank sum tests over to using Chris's implementation in order to harmonize the codebase. There isn't any reason to have competing implementations of rank sum. Thanks to Chris for adding the necessary hypothesis testing options. WilcoxonRankSum.java will be deleted soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5559 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-02 22:26:35 +00:00
asivache df53351b0f Get rid of score cutoff at 0 in the alignment matrix (i.e. score[cell] = max(0, score[from_parent_cells]). Use the computed score as is. Technically, it's pretty much NW now, not SW.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5548 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 00:11:04 +00:00
chartl 328f89f66a Minor changes to MannWhitneyU:
- Comment fixes to better explain why two-sided test wants to use the LOWER (not higher) value for U
 - Much more direct testing of MWU functions
 - Uniform approximation was always using the < cumulant (sometimes the > cumulant should be used instead)
 - Uniform approximation currently not used (regime in which it was being used was not the right one -- not necessarily bad, but not an improvement over normal)
    + this particular approximation is for major imbalances of the form m >> n. Code may be altered in the future to use this method for this particular regime, if the method's not too slow.
 - Hook into one-sided test.

RegionalAssociationRecalibrator: NaNs were being caused by presence of Infinity and -Infinity values out of the walker. Currently I'm just re-setting them to arbitrary post-whitened values, but the walker will be changed to prevent output of these values, and the "fix" will undone.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5539 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 17:03:02 +00:00
chartl f6dfdc7f3b Single-tailed hypothesis testing in MWU
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5533 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 15:53:40 +00:00
depristo 231d095316 A clean, fast way to compute fragment pileups. Now consumes no CPU time at all. Ready for general use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5524 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 14:26:29 +00:00
depristo 6a1d12cf7b Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:42:22 +00:00
depristo 27c8fb1e4d Added support for a general GATK option --simplifyBAM to automatically remove and simplify kept reads in an output BAM file. Specifically, duplicate, non-PF, and unmapped reads are removed, and all extended tags in the retained SAM records are removed except the RG:Z tag. This option is very useful when creating temporary BAM files (merged per-population or multi-sample cleaned) for future calling (as in the 1000G processing pipeline). Results in a significant reduction in space of the resulting BAM, faster reading of the BAM, and surprisingly even faster UG performance:
1-10mb of chromosome one, from NA12878 HiSeq 64x data set on hg18:

Full BAM
Write time: 8.6 m
Size: 866M
CountReads time: 2.9 m
UG time: 11.3 m

Simplified BAM:
Write time: 6.2
Size: 458M
CountReads time: 85.7 s
UG time: 10.1 m


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5517 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 01:21:35 +00:00
kshakir fc8acd503e Enabled the parameterize option for debugging PipelineTest MD5s.
Fixed escaping expressions that have more than one space between arguments.
Updated example to match the wiki.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5516 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 00:41:47 +00:00
ebanks 18271aa1f4 It never fails to amaze me that aligners can find so many different ways to place indels off the ends of contigs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5503 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 04:17:23 +00:00
chartl 5a79f16ea4 Fixed an edge case where an exception was thrown if either of the sets was empty for the MWU test. Also altered the output format so U itself is not printed (which though interesting, isn't so useful for recalibration), but rather a value I call V (really the deviation of U from its expectation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5490 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 16:28:44 +00:00
ebanks 1c95208e26 Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:28:19 +00:00
depristo 22ff2573d5 Removed MAG entirely
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5474 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:43:23 +00:00
kshakir b2b8a4f19f Re-un-final'ed BAQ.MAG as it was pre r5469.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5472 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:40:31 +00:00
depristo 7857cb5a22 Waiting to go to the hospital -- fixed a bug in the BAQ calculation where the BAQ would NPE if a read had no usable bases (all clipped, for example) but didn't fail the PF filter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5469 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 17:45:21 +00:00
depristo 6281c1db6f A nicer error (UserException now) for malformed genome locs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5465 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 02:58:29 +00:00
depristo c1798a7dbc Whitespace cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5460 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 18:04:08 +00:00
carneiro e2e435d52c GenotypeAndValidate: now looks at annotations in the INFO field instead of filter field. Better output and filters repetitive calls to indel extended events.
IndelUtils: added a isInsideExtendedIndel() method to filter the above mentioned.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5449 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:54:40 +00:00
carneiro 4b9b767eb1 SelectVariants: now keeps the YAML stuff internal... it's there if you wanna use it, but won't be published anymore. Official parameter is the string for now.
VariantEval: now sports the new MendelianViolation utility class.
MendelianViolationClassifier: I noticed I had broken chartl's walker by changing VariantEval, so I took the liberty to modify it to use the new library too, though I kept modifications to a minimum, could have gone into full integration if this is a useful tool, but since it's in oneoffs, I decided not to go all out.
MendelianViolation: Some getter methods were added for chartl and VariantEval.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5447 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 18:36:55 +00:00
rpoplin 2a2538136d A version of VQSRv2 that does contrastive clustering in two passes. The walkers will be renamed when they are moved to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5443 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:03:56 +00:00
carneiro 33c7593218 YAML integrated mendelian violation utility class, integrated and tested through select variants. Wiki is updated.
ps: I moved it out of tribble. If you think it should reside in a different place, just yell at me.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5436 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 16:43:37 +00:00
depristo b99e27bf9b In the process of optimizing ProduceBeagleInputWalker, discovered that the GenotypeLikelihoods, the UG, and Genotype objects were using old-style GL tags internally, and then converting from Likelihoods -> GL String -> Likelihoods -> PL String throughout the GATK. It was both painful and led to convoluted code throughout the system. Removed everything but GL conversion -> PL in the GenotypeLikelihoods objects, and now all of the codes in UG now immediately provides GenotypeLikelihoods to the Genotype objects, which is converted straight to PL now. Resulted in a 30% speed up in ProduceBeagleLikelihoods, passes integration tests without any modifications, and likely speeds up writing any VCFs with likelihoods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5432 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-13 00:07:51 +00:00
depristo d01d4fdeb5 Optimized version of produce beagle tool, along with experimental (hidden) support for combining likelihoods depending on estimate false positive rate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5430 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 02:06:28 +00:00
delangel b03055099a a) Changed the way we classify and log indel events (e.g. in IndelClasses table inside IndelStatistics VE module). Made names clearer, and split logging of event length with number of repetitions of event.
b) Add an experimental annotation to log indel type string inside the INFO field, just for debugging/temp analysis purposes (will consider making it standard if it proves useful). 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5424 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-11 17:37:41 +00:00
depristo ccc773d175 Refactoring, cleanup, and performance improvements to ProduceBeagleInput. It's really a shame that there's no integration tests...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5418 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-11 13:55:30 +00:00
ebanks 4baeb5979f It turns out that Math.log10() can return 0, which leads to QUALs being set to -0, which is off-spec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5415 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-10 03:08:56 +00:00
ebanks 3596c56602 New attempt at the constrained movement version of the indel realigner (I've kept around the old writer for now). The new contract is that the realigner must ask permission before trying to clean an area; permission will be denied by the CM-Manager if it was required to flush its cache of reads because of too much depth within a distance of maxInsertSizeForMovingReadPairs. Added integration tests to cover different max cache sizes, including an expected exception when too small a value is chosen. The actual logic changes were fairly minor - much of this commit is really just some cleanup. I'd like to throw 1000G Phase I at it, but will respectfully wait for Ryan to hit his deadline before doing so.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5414 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-10 02:48:29 +00:00
rpoplin 509daac9f7 Minor bug fix in k-means implementation. Updating VQSR integration tests in preparation for VQSRv2 by removing some unused features such as VariantDatum.weight and ti/tv cutting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5410 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-09 00:26:28 +00:00
chartl 1b310401fe Due to the approximation not being well-founded in this case, (and the non-existence of a pre-computed table at this time), pushing up the cutoff
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5405 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-08 16:24:42 +00:00
chartl 77fe902dbd Testing modules now use wider windows and heftier shift, hopefully this will remove some of the noisiness of the results. Some UStatistics were changed to TStatistics to try and limit noisiness as well. Walker will also additionally write out wiggle files directly (which can be converted into "proper" tdf files via igvtools tile [args] [in].wig [out].tdf [ref]) subject to some restrictions. MWU could get stuck in a long-running recursive regime, it'd be nice to have a table lookup or a good small-n large-m approximation, for now the uniform should work just fine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5403 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-08 15:26:13 +00:00
hanna 85ff983a59 Failed to include some required GenomeLoc utilities in my last commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5397 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 23:00:17 +00:00
delangel 8c262eb605 Initial commit of new likelihood model to evaluate indel quality. Principle is simple, a plain Pair HMM with affine gap penalties (in log space) that does quasi-local alignment between reads and candidate haplotypes and which in theory should be more solid and more reliable than the older Dindel-based model. It also allows to be easily extensible in the future if we decide to introduce either context-dependent and/or read-dependent gap penalties.
Model is disabled by default and we're still using the old Dindel model until I'm more confident that new model is a definitive improvement, so right now this is enabled by hidden command line arguments, and it's not to be used yet.

In detail:
a) Several refactorings to share softMax() available to other modules, so its now part of MathUtils.
b) Refactored a couple of read utilities and moved from BAQ to ReadUtils.
c) New PairHMMIndelErrorModel class implementing new likelihood model
d) Several new hidden debug arguments in UAC.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5389 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 15:31:58 +00:00
chartl 60ddc08cdf Added a boatload of new case-control association modules. Switched the U-test to use longs rather than ints (it just so happened that I overflowed and started getting negative U statistics. Not good.) Added the ALL association type for ease of specifying that we want to throw the book at something. Added an svn-commit.tmp~ because i can't get rid of it even with --force. Hopefully I can remove it after.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5386 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 21:58:12 +00:00
depristo af71576a07 CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:25:52 +00:00
chartl a40a8006b5 Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
hanna c40efe1dea Fixed exception for BAMs without filenames (unit tests, BAM input streaming,
etc.).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5379 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 13:43:49 +00:00
depristo ad51f30244 A trivial, but useful, sum of a list of integers
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5378 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-05 06:09:05 +00:00
chartl 9ca1dd5d62 Miscellaneous changes:
- RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it)
 - VCFUtils: grabbing VCF headers given a prefix 
 - MathUtils: Useful functions for calculating statistics on collections of Numbers
 - VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure.
 - Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored
      in the window optionally different from those returned by whatever statistic is run across the window
Added:
 - MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that
    it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation
    for when one of the sample sets is small, and the other large. Unit tests to follow.
 - BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations
 - BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 22:43:36 +00:00
hanna 7a22f19366 More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated
case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and
ease-of-use.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 03:55:24 +00:00
ebanks 660998065b 'Okay, now I'm absolutely certain that there are no more bugs in the constrained writer.'
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5353 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 03:48:40 +00:00
asivache 570186fa42 Added (deep) clone() and merge() to the RunningAverage utility class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5350 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 00:35:23 +00:00
chartl 0723b0f44c Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
delangel d059d89a9d Fixes and cleanups for indel eval module. Also outputs AT/CG ratio in dedicated column in IndelStatistics.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5332 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 12:07:50 +00:00
ebanks 05fac8583d Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:40:54 +00:00
depristo 1dedfdb11b Fixes for constrained movement Indel Realigner. Now sorts all of the reads in the interval before handing them to ConstrainedMateFixingSAMFileWriter to maintain correct contract between the two pieces of software
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5329 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 03:52:18 +00:00
ebanks 5d28cbda27 When crossing contigs it's crucial that the queue get flushed or else it will continue to accumulate reads without emitting. This is the last time I trust someone when they tell me that they are 'confident there are no bugs' in a tool.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5315 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 05:18:30 +00:00
rpoplin 1129f1535d Fix for the HaplotypeScore optimization in AlignmentUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5310 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:40:18 +00:00
rpoplin 255cc246a2 Change in Methods development pipeline: dbsnp130 can't be used for anything, changed it to dbsnp129. Optimization for HaplotypeScore and the to-be-committed ReadRosRankSumTest in AlignmentUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5301 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 16:09:03 +00:00
ebanks 93888e570b Phase 2: after hours of testing, confirming that constrained mode looks good so moving the integration tests over to use it. Some cleanup. More cleanup coming in Phase 3.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5298 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 06:23:41 +00:00
depristo 1a5d296737 ReplaceReadGroups. Fixes BAM files without read group info. MissingReadGroup points people to this tool now. Please point users on the forum to this tool now. Will migrate to Picard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5284 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-21 14:02:41 +00:00