Commit Graph

1098 Commits (af09170167fa094fc43563848f92fc2fb3c3ae75)

Author SHA1 Message Date
ebanks af09170167 As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 17:03:48 +00:00
kshakir 45ebbf725c Instead of always merging Picard interval files they are optionally merged by Sting Utils.
Disabled the MFCP while the FCP gets an update.
Minor updates to email messages for upcoming scala 2.9.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5588 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 21:12:05 +00:00
rpoplin 3f3f35dea0 UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:06:24 +00:00
ebanks 4b451314b2 Only store a read in the mate hash if it could possibly be moved. This reduces memory consumption especially when dealing with a case of tons of unmapped reads at the end of the bam; however, it's only mildly helpful for chr1 of the Papuans (there's a truly massive pileup 120Mb into it; more thought needed at a later point). Integration tests changed only because some of the reads in the original bam were busted to begin with (it's an old pilot 1000G bam).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5580 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-05 22:20:09 +00:00
chartl 79b5fa6cc5 Structural refactoring in advance of dichotomization statistics; generalization of statistical test infrastructure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5579 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-05 18:52:32 +00:00
chartl bb6a30611c Forgot to modify the test too. What a bad commit. Sorry guys.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5575 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-05 02:11:08 +00:00
droazen db9908ec02 Small correction to the unit test code from my last commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5572 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-04 18:55:38 +00:00
droazen a5acb0b7a6 Fix for bug GSA-314: Detect -XL and -L incompatibility. An ArgumentException is
now thrown if the combination of -L and -XL intervals specified on the command 
line results in an empty interval set after set subtraction. 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5571 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-04 18:41:55 +00:00
depristo 095125152b Updated to now longer include 2nd-best base output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5567 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 20:13:10 +00:00
droazen 0927b7c297 Fix for bug GSA-441: BAM file list with blank lines gives a confusing error
message. Lines containing only whitespace in .list files are now ignored. 
Also added support for comments in .list files: lines whose first
non-whitespace character is '#' are now also ignored.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5550 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 15:04:35 +00:00
droazen 7b452ea2b9 Fix for bug GSA-430: Can't specify same BAM file twice on the command line. An ArgumentException with an appropriate error message and a list of the duplicate BAMs is now thrown in this case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5542 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 22:23:24 +00:00
chartl 328f89f66a Minor changes to MannWhitneyU:
- Comment fixes to better explain why two-sided test wants to use the LOWER (not higher) value for U
 - Much more direct testing of MWU functions
 - Uniform approximation was always using the < cumulant (sometimes the > cumulant should be used instead)
 - Uniform approximation currently not used (regime in which it was being used was not the right one -- not necessarily bad, but not an improvement over normal)
    + this particular approximation is for major imbalances of the form m >> n. Code may be altered in the future to use this method for this particular regime, if the method's not too slow.
 - Hook into one-sided test.

RegionalAssociationRecalibrator: NaNs were being caused by presence of Infinity and -Infinity values out of the walker. Currently I'm just re-setting them to arbitrary post-whitened values, but the walker will be changed to prevent output of these values, and the "fix" will undone.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5539 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 17:03:02 +00:00
rpoplin 5ddc0e464a Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 21:04:09 +00:00
chartl f6dfdc7f3b Single-tailed hypothesis testing in MWU
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5533 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 15:53:40 +00:00
depristo cd8321cdc9 Removed the completely unused generic but extremely expensive infrastructure for dynamic LocusIteratorFilters. Now the one, and probably only useful one, is called directly in the LocusIteratorByState itself to filter adaptor bases from reads. This shaves 10% off the runtime of all walkers, apparently. Has the additional benefit of eliminating a lot of complex infrastructure that resulted ultimately in only a single function call.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5525 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 20:48:24 +00:00
depristo 3bcd4c5d75 --simplifyBAM is now in the SAMFileWriterArgumentTypeDescriptor, as suggested by map. PrintReads has an integrationtest now that writes out a 1 MB bit of HiSeq normally, with compress 0, and with simplifyBAM on.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5521 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 14:57:18 +00:00
depristo 7272fcf539 Now uses the NO_HEADER option to avoid breaking MD5s due to changes in GATKArgumentCollection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5518 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 12:00:37 +00:00
kshakir fc8acd503e Enabled the parameterize option for debugging PipelineTest MD5s.
Fixed escaping expressions that have more than one space between arguments.
Updated example to match the wiki.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5516 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 00:41:47 +00:00
ebanks 69646ff840 ... and the corresponding integration test update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5496 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 01:58:07 +00:00
chartl 5a79f16ea4 Fixed an edge case where an exception was thrown if either of the sets was empty for the MWU test. Also altered the output format so U itself is not printed (which though interesting, isn't so useful for recalibration), but rather a value I call V (really the deviation of U from its expectation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5490 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 16:28:44 +00:00
ebanks 1c95208e26 Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:28:19 +00:00
rpoplin d98503ca50 Removing some debug code from VQSRv2. VariantEval can now be stratified by contig with -ST Contig. New hidden option in CombineVariants for overlapping records to take the info fields from the record with the highest AC (while still updating AC/AN/AF correctly) instead of dropping info fields which aren't exactly the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5448 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:28:10 +00:00
rpoplin 2a2538136d A version of VQSRv2 that does contrastive clustering in two passes. The walkers will be renamed when they are moved to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5443 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:03:56 +00:00
depristo 3e3ec85807 Checked for consistency with the previous integration tests, and updated the walker and test to use the new I/O system (always prints 4 digits on floats.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5433 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-13 15:24:22 +00:00
depristo ee8f2871f7 A better output for Genotype Concordance summary. Now does only % comp hom-ref called hom-ref, het called het, and hom-var called hom-var, which are the quantities we typically show in slides. Updated intergration tests to reflect this change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5429 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 02:03:48 +00:00
rpoplin b3464a6031 Initial commit of VQSRv2 that passes the old integration tests. Not ready to be used yet unless your name rhymes with ... oh wait, that's me.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5419 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-11 15:18:34 +00:00
ebanks 3596c56602 New attempt at the constrained movement version of the indel realigner (I've kept around the old writer for now). The new contract is that the realigner must ask permission before trying to clean an area; permission will be denied by the CM-Manager if it was required to flush its cache of reads because of too much depth within a distance of maxInsertSizeForMovingReadPairs. Added integration tests to cover different max cache sizes, including an expected exception when too small a value is chosen. The actual logic changes were fairly minor - much of this commit is really just some cleanup. I'd like to throw 1000G Phase I at it, but will respectfully wait for Ryan to hit his deadline before doing so.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5414 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-10 02:48:29 +00:00
rpoplin ff7edc4493 Minor bug fix in empiricalMu prior calculation in VQSR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5412 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-10 00:42:38 +00:00
rpoplin 509daac9f7 Minor bug fix in k-means implementation. Updating VQSR integration tests in preparation for VQSRv2 by removing some unused features such as VariantDatum.weight and ti/tv cutting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5410 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-09 00:26:28 +00:00
delangel 00ac51acc8 Added several integration tests for UG indel caller:
- Basic
- Multiple technology
- Test minIndelCnt parameter

Added also 2 disabled tests:
- Parallelization: issue w/code right now is that if -nt > 1, filter field shows "PASS" instead to ".", cause TBD
- Genotype given alleles mode: code not working yet.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5404 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-08 16:21:21 +00:00
kiran d0598c7a04 Somehow missed this test when I was updating the md5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5400 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 23:53:42 +00:00
kiran b6339967f8 Updated GenomicAnnotator integration tests to include the -NO_HEADER argument so that they tests op yelling about trtrivial differences
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5398 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 23:07:01 +00:00
kiran 43056d0188 Fixed integration test to reflect changes regarding when comp tracks got subset to fewer samples and whether no-call sites would get pulled in for comp tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5393 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 20:25:57 +00:00
kshakir dc33fbed7c Switched the CVUnitTest broken info from an Integer to a String since as of r5383 Integers are no longer broken when converted to Floats.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5390 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 16:33:14 +00:00
chartl 60ddc08cdf Added a boatload of new case-control association modules. Switched the U-test to use longs rather than ints (it just so happened that I overflowed and started getting negative U statistics. Not good.) Added the ALL association type for ease of specifying that we want to throw the book at something. Added an svn-commit.tmp~ because i can't get rid of it even with --force. Hopefully I can remove it after.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5386 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 21:58:12 +00:00
depristo af71576a07 CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:25:52 +00:00
chartl a40a8006b5 Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
kiran 1861ca90fc A change to the definition of CpG sites (is now, from 5' to 3' a CG dinucleotide in the reference, and the CpG site is at the C, rather than either at the C or a G).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5373 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-04 15:36:07 +00:00
hanna 7a22f19366 More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated
case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and
ease-of-use.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 03:55:24 +00:00
ebanks bb969cd3a2 EMIT_ALL_SITES now does exactly that - even when there's no coverage or too many deletions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5343 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 05:05:00 +00:00
ebanks 5ac9af472c Adding performance test for case with very high coverage (> 600,000x) over an interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5336 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:48:56 +00:00
ebanks 05fac8583d Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:40:54 +00:00
hanna 600f73cbd6 A checkpoint commit of two BAM reading projects going on simultaneously. These two projects
are works in progress, and this checkin will provide a baseline against which to gauge 
improvements to both projects.

Low-memory BAM protoshards (disabled by default):
- Currently passing ValidatingPileupIntegrationTest.
- Gets progressively slower throughout the traversal, but should run at least as fast as original implementation.
- Uses 10+ file handles per BAM, but should use 3.

BAM performance microbenchmark test system:
- Currently tests performance of BAM reading using SAM-JDK vs. GATK
- Tests do not hit all GATK performance hotspots.
- New tests that require input data in a slightly different form are hard to implement.
- Output of test results is not easily parseable (investigating Google Caliper for possible improvements).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5317 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 17:50:32 +00:00
ebanks cba88a8861 Elegant solution to the determinism problem: force testNG to run tests in the order that I want it to.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5312 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 21:32:35 +00:00
ebanks 15dfac6bf7 Updating integration test to be in sync with previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5309 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:21:58 +00:00
ebanks 06e3c34e7f Updating performance test to be in sync with previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5308 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:13:35 +00:00
chartl 97e1a5262e -ct x no longer includes coverage in the previous bin
BatchMerge - additional support for indels (can't just test the alternate allele when it's an extended event, must also specify that you want to use the dindel model when you actually test the allele)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5300 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 15:52:04 +00:00
ebanks ee6f112556 Phase 3: constrained movement is now the only option available in the realigner (so I guess technically it's not really an option). Several command-line options are deprecated. Code cleaned up. Wiki updated. Release coming. One phase left...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5299 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 14:59:48 +00:00
ebanks 93888e570b Phase 2: after hours of testing, confirming that constrained mode looks good so moving the integration tests over to use it. Some cleanup. More cleanup coming in Phase 3.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5298 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 06:23:41 +00:00
ebanks c59c8b9872 Phase I of my promise to Mark: fleshed out integration tests for Indel Realigner
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5297 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 21:05:20 +00:00