Commit Graph

1166 Commits (36db9bdcd55aa9668d7b5c2bed620a2eb96f21bc)

Author SHA1 Message Date
kiran 43056d0188 Fixed integration test to reflect changes regarding when comp tracks got subset to fewer samples and whether no-call sites would get pulled in for comp tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5393 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 20:25:57 +00:00
kshakir dc33fbed7c Switched the CVUnitTest broken info from an Integer to a String since as of r5383 Integers are no longer broken when converted to Floats.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5390 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 16:33:14 +00:00
chartl 60ddc08cdf Added a boatload of new case-control association modules. Switched the U-test to use longs rather than ints (it just so happened that I overflowed and started getting negative U statistics. Not good.) Added the ALL association type for ease of specifying that we want to throw the book at something. Added an svn-commit.tmp~ because i can't get rid of it even with --force. Hopefully I can remove it after.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5386 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 21:58:12 +00:00
depristo af71576a07 CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:25:52 +00:00
chartl a40a8006b5 Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
kiran 1861ca90fc A change to the definition of CpG sites (is now, from 5' to 3' a CG dinucleotide in the reference, and the CpG site is at the C, rather than either at the C or a G).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5373 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-04 15:36:07 +00:00
hanna 7a22f19366 More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated
case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and
ease-of-use.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 03:55:24 +00:00
ebanks bb969cd3a2 EMIT_ALL_SITES now does exactly that - even when there's no coverage or too many deletions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5343 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 05:05:00 +00:00
ebanks 5ac9af472c Adding performance test for case with very high coverage (> 600,000x) over an interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5336 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:48:56 +00:00
ebanks 05fac8583d Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:40:54 +00:00
hanna 600f73cbd6 A checkpoint commit of two BAM reading projects going on simultaneously. These two projects
are works in progress, and this checkin will provide a baseline against which to gauge 
improvements to both projects.

Low-memory BAM protoshards (disabled by default):
- Currently passing ValidatingPileupIntegrationTest.
- Gets progressively slower throughout the traversal, but should run at least as fast as original implementation.
- Uses 10+ file handles per BAM, but should use 3.

BAM performance microbenchmark test system:
- Currently tests performance of BAM reading using SAM-JDK vs. GATK
- Tests do not hit all GATK performance hotspots.
- New tests that require input data in a slightly different form are hard to implement.
- Output of test results is not easily parseable (investigating Google Caliper for possible improvements).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5317 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 17:50:32 +00:00
ebanks cba88a8861 Elegant solution to the determinism problem: force testNG to run tests in the order that I want it to.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5312 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 21:32:35 +00:00
ebanks 15dfac6bf7 Updating integration test to be in sync with previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5309 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:21:58 +00:00
ebanks 06e3c34e7f Updating performance test to be in sync with previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5308 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:13:35 +00:00
chartl 97e1a5262e -ct x no longer includes coverage in the previous bin
BatchMerge - additional support for indels (can't just test the alternate allele when it's an extended event, must also specify that you want to use the dindel model when you actually test the allele)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5300 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 15:52:04 +00:00
ebanks ee6f112556 Phase 3: constrained movement is now the only option available in the realigner (so I guess technically it's not really an option). Several command-line options are deprecated. Code cleaned up. Wiki updated. Release coming. One phase left...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5299 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 14:59:48 +00:00
ebanks 93888e570b Phase 2: after hours of testing, confirming that constrained mode looks good so moving the integration tests over to use it. Some cleanup. More cleanup coming in Phase 3.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5298 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 06:23:41 +00:00
ebanks c59c8b9872 Phase I of my promise to Mark: fleshed out integration tests for Indel Realigner
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5297 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 21:05:20 +00:00
ebanks 318035c147 Fixing up the output system of the Unified Genotyper. Deprecating the -all_bases and -genotype arguments. Adding instead the --output_mode (EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES, EMIT_ALL_SITES) and --genotyping_mode (DISCOVERY, GENOTYPE_GIVEN_ALLELES) arguments. UG now does the correct thing when passed alleles (bound to the 'alleles' rod) to use for genotyping; added several integration tests to cover this case. This commit will break the batched calls merging script, but Chris knows this and is ready for it...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5288 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 06:07:18 +00:00
kiran 52f860c9b2 Modified MD5s to account for Andrey's new MNP column in CountVariants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5274 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 13:13:58 +00:00
kiran cb95e68fc0 CpG is no longer a standard stratification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5273 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 07:17:35 +00:00
kiran 9ddee96f93 When subsetting by sample, need to take extra care that hom-ref sites don't accidentally get treated as variant sites in CompOverlap. Renamed convenience method for creating command-lines in integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5272 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 06:26:38 +00:00
kiran 92c82200c9 Fixed an issue where an eval module with TableType objects would get an extra, empty table in the output, screwing up the parse in R.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5267 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:03:46 +00:00
kshakir d185c2961f Added pipeline for calling FCP in batches called MultiFullCallingPipeline.
Bug smashes for the MCFP:
  Synchronized access to LSF library and modifications to the QGraph.
  If values are missing from the graph with -run make sure to exit with a non-zero.
  Refactored QGraph to pre-generate a unique Int for each QNode speeding up getHashCode/equals inside the graph.
  Added jobPriority and removed jobLimitSeconds from QFunction.
  All scatter gather is by default in a single sub directory queueScatterGather.
  Moved some FCPTest into BaseTest/PipelineTest for use by MFCPTest.
  Rev'ed the 1000G bams used for validation from v1 to v2 and added code to look for the bams before running other tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5247 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 18:26:14 +00:00
kiran d3660aa00e Very basic functionality for annotating indels (specifies whether the indel is frameshift, inframe, or non-coding). Does not attempt to recalculate the variant codon, variant amino acid, or whether the site falls within a splice region. Added a convenience method to WalkerTest for building command-line arguments with the proper spacing (so that I stop getting annoyed when I've gotten it wrong and the test system yells at me.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5235 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-13 17:58:20 +00:00
ebanks 9554df1a7c Adding integration test for indels in VF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5227 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 16:58:57 +00:00
hanna b992abb6eb A few more unit tests plus some extra
functionality for BAM index visualization.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5222 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-09 01:51:34 +00:00
kshakir 4d1cca95bb Removed deprecated getDbsnpFile.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5221 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:12:15 +00:00
kiran ecbc38aff0 If no comp rod is specified, specify the dummy name none so that we still get counts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5211 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 19:24:52 +00:00
ebanks 698096dc5a Moving VariantsToVCF to the proper directory; removing the oneoffs CG indel converter in preparation for a ligitimate CG variant Feature class in the works.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5207 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 05:21:01 +00:00
kiran 35c688ac67 Updated md5 for testVCFStreamingChain to reflect latest changes to VariantEval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5206 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-06 21:22:05 +00:00
kiran 1085bbf303 Fixed issue where all comp tracks were being treated as known tracks. Fixed issue where multiple JEXL expressions were causing an exception because the underlying object did not implement the Comparable interface. Fixed issue where variants being compared to the known track were not being checked for equality of variation type. Fixed issue where functional annotations were not being iterated over properly. Refactored a lot of helper methods into a separate VariantEvalUtils utility class. Significantly expanded the test suite using a small VCF with SNPs, indels, and non-variant loci which makes it much easier to see what the proper answer should be, and included the appropriate grep and awk commands in the comments to confirm the values.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5204 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-06 19:19:20 +00:00
depristo ce51ffb56e Oops, old local paths committed on accident.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5200 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 23:35:56 +00:00
depristo 29f3ad72f3 SAMFileWriter that allows the user to move reads, but only a bit, in an incoming coordinated sorted BAM files. Does some local reordering and local mate fixing, under specified constrained. These constrains allow us to make a special -- under testing for Eric, who promised to try this out a bit, expand test cases and integration tests -- but soon to be the default and only model of the realigner that only moves reads with ISIZE < 3000 that directly emits a coordinate sorted, mate fixed validating BAM file without needing FixMates externally. Preliminary testing shows this runs in a totally fine amount of memory and produces equivalent results to the previous version.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5199 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:27:05 +00:00
hanna 5c3198520c A few minor modifications masquerading as significant changes according to
svn's logs:
- Copied BAM indexing engine from Picard back into the GATK anticipating
  shard merging algorithm.  Tried to leave most of the building blocks in
  Picard.  If this turns into a logistical nightmare, I'll merge the building
  blocks into the GATK as well.
- Reorganized the org.broadinstitute.sting.gatk.datasources package, giving
  better separation of query and management functionality for reads, ref, rmd,
  and samples.  
- Merged Shard building blocks into org.broadinstitute.sting.gatk.datasources.
  reads package, indicating it's current strong relationship with the reads,
  rather than the general unifying element I wish this would be.
- Collapsed BAMFormatAwareShard into Shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5184 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:59:19 +00:00
kiran 9ddc95c833 NewEvaluationContext needs to be generated in the inner loop. Otherwise, multiple comp tracks end up getting routed to the same row of the output table. Added test to cover multiple comp tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5181 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 07:04:53 +00:00
kiran cb6454bf98 Multiple eval tracks should be bound with different names, rather than just 'eval'. Added tests to cover usage with multiple tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5177 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 22:33:50 +00:00
kiran 2732c839d4 Restored parallelism and associated tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5170 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 02:04:03 +00:00
kiran fd8dd8fb9b Fixed an issue where a no-call in the eval track would prevent a site from a comparison track from being loaded. Added a new test to cover the use case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5169 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 01:47:53 +00:00
hanna 06b63d8336 Pulled out CpG stratification in test results at Kiran's suggestion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5165 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 18:36:09 +00:00
hanna 91297c138b Update VCFStreamingIntegrationTest to use new variant eval command-line
arguments, output format.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5162 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 15:40:43 +00:00
hanna 7d89ce820b Got tired of waiting for Kiran to fix the build: updated NewVariantEval ->
VariantEval.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5161 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 15:32:39 +00:00
hanna 96241c6637 More testng fallout: fixing another seemingly 'random' issue arising from an
alternate test ordering.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5160 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 15:25:50 +00:00
kshakir e74f28ad89 If there's an LSF queue maximum time limit set and the user hasn't specified one for this job, pass on the queue defined maximum limit with the job.
Updated LibBatIntegrationTest to use proper networked temp directory accessible by local machines and nodes.
Disabling the FCPTest until the VE3 is incorporated into the fullCallingPipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5151 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 23:13:09 +00:00
kshakir d4f744a4d4 Checking if the interval files exist before using them to calculate the minimum scatter parts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5143 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 18:07:34 +00:00
kiran 401feca90d Updates to VariantEval 3.0 integration test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5140 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 17:45:06 +00:00
ebanks d406d9b3fc There's no reason to special case no-calls if they already have PLs associated with them. Just use the PLs!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5136 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 15:05:45 +00:00
hanna 5e7a5cf924 Quick fix for Danny Lieber: flesh out the additional functionality required
to align to a reference other than what's specified in the header.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5133 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 05:28:37 +00:00
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
asivache 04d66a7d0d Updated integration test's MD5s reflect the fact that assay sequences were previously designed incorrectly for indels, the bug is now fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5120 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:00:22 +00:00
depristo 2182b8c7e2 Better query start / stop function that directly parses the cigar string, unlike the previous version. Now properly handles H (hard-clipped) reads. Added -baq OFF and -baq RECALCULATE integration tests on all three 1KG technologies. Please let me know if this new code somehow fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5108 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 15:08:21 +00:00
kiran 9cb1ae384c Constant precision for floating point numbers. Added integration test - carries over tests from VariantEval with the necessary modifications to command-line arguments and md5s. Disabled use of 'synchronized' keyword because I clearly don't get how that keyword is supposed to work yet...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5107 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 05:19:18 +00:00
depristo f29bb0639b Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:40:09 +00:00
depristo f522eb2848 Previous tests were just too big...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5095 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 13:48:38 +00:00
hanna 4a33cdacde Some basic integration tests detecting breakage in OTF BAM index generation.
Doing it manually for the moment so that there's at least something testing
this capability; will followup eventually with Mark to see whether we can
shape the VCF index generation code in such a way that it supports BAM index
testing as well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5093 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 23:48:04 +00:00
ebanks dfc5a3d1f3 added integration test for --sites_only option
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5082 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:58:15 +00:00
depristo be697d96f9 An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 13:45:40 +00:00
kshakir 9923e05e0a Moved MD5 utils from WalkerTest to BaseTest for use by PipelineTests.
Moved VariantEval validation from FCPTest to PipelineTest.
Cleaned up some duplicate code for writing temp files during tests.
Moved FCPTest to playground namespace to match move for FCP.q.
Added a basic HelloWorldPipelineTest for the HelloWorld QScript. 
Moved duplicated error handling from JobRunners into the FunctionEdge.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5068 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 04:11:49 +00:00
hanna 9db02059ac Fix for Ryan's issue: reads ending with indel distort the location of the
pileup, resulting a two map() calls for the same locus (and no map call for
the locus immediately following).
Fixed bug and added comprehensive unit tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5067 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 19:49:39 +00:00
depristo c50f39a147 V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 16:45:07 +00:00
depristo a51061fd96 Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
ebanks 2d4bcb60a1 Don't print out alt alleles for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5060 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:33:31 +00:00
ebanks 2bbcc9275a Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 05:04:17 +00:00
depristo 9b1b8d46aa Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:24:42 +00:00
hanna aea121a9d5 <key>=<value> tagging support for command-line arguments. Unfortunately, still
very hard to validate and still very hard to use (requires core hacking to 
support additional tags).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5038 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 00:22:42 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
hanna 8831ec3dce Some refactoring and cleanup around the area of my sleep-deprived integration
test typo, which Khalid already fixed for me.  Sorry, Khalid!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5035 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 15:03:14 +00:00
kshakir 3022f4dfa0 Fixed missing space character in testSimpleVCFStreaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5034 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 14:49:38 +00:00
depristo 85553cf5cb V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
See documentation at:

http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29

for examples on how to run this, or the testing Scala script

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
hanna 7087c2f422 Very simple integration tests for basic VCF streaming functionality.
Rather than try to fork the integration test process to get a pipe source
and sink, creates a new named pipe by Runtime.exec()ing the 'mkfifo' shell
command.  We'll see whether this proves to be a reliable method for testing
streaming.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5028 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 04:38:54 +00:00
kshakir 2b895ffb7f Updated the HG19 reference from v0 to v1 after the v0 was zeroed out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5023 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 20:30:25 +00:00
depristo f8ba76d87c Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
depristo a88708ebfa Moving GLF code to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-15 22:42:42 +00:00
hanna af31d02a2d Fix concurrency issue that periodically kills VariantEvalIntegrationTest --
a member field of RMDTrackBuilder was getting rebuilt every time it was
called, creating concurrency issues.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5001 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 18:52:21 +00:00
depristo af1bce3492 Longer wait time for threading test (5 min now) and an assertion to ensure that all jobs finished. Should probably just remove the longer running test so this becoming a non-issue
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4999 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 13:09:31 +00:00
depristo afbea9ce59 SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 03:14:24 +00:00
hanna 02dc0f97d1 Remove testWalkerUnitTest; it doesn't actually do anything and just adds
extra cruft to the output.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4993 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 19:02:00 +00:00
rpoplin ce3d226183 Reverting back to the old definition of QD because it works better with large numbers of samples. The new QD is relegated to a new annotation: sumGLbyD. Tweaks to the new HaplotypeScore based on evaluation with better QD calculation. The default qual threshold in GenerateVariantClusters is updated to be in line with the variant quality scores coming from the exact model.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4984 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:12:30 +00:00
hanna edebbb5aa0 Fixed long-standing bug reported by Mauricio where @Arguments assigned to
primitive types were properly validated and throw the proper 
MissingArgumentValue UserException.  Before this fix, the error reported
was the infamous DePristo BSOD (Could not create module String because 
an exception of type NullPointerException occurred caused by exception null).

Thanks Mauricio!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4980 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 22:18:24 +00:00
hanna 6d855041ec Oops...forgot to commit the changes that allow primitive VCF streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 21:54:51 +00:00
depristo 8fe5641b2e can explicitly set the now required ReferenceDataSource in unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4977 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:25:12 +00:00
aaron 7916ab0ed5 remove the index each run
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4976 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:38:22 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
aaron cba436fa2f small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 21:20:26 +00:00
hanna 0982d35f5b Bug fixes in streaming in Tribble data via /dev/stdin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 02:43:04 +00:00
rpoplin 23dbc5ccf3 HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 00:28:05 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
hanna 3fc9862964 Unit test fixed - Tribble codecs aren't designed to be stateless, but I was
using one as though it was.  Fixed, and debug code reverted.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 17:47:52 +00:00
hanna b9cb57f4b9 A unit test is failing on bamboo in a way I can't reproduce (or even explain).
Checking in some debugging info.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 16:35:04 +00:00
hanna cba18116e4 A significant refactoring of the ROD system, done largely to simplify the process of
streaming/piping VCFs into the GATK.  Notable changes:
- Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build 
  RMDTracks and lookup codecs.
- RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now
  manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class
  are the walkers that feel the need to access the ROD system directly.  (We need to stamp out
  this access pattern.
A few minor warts were introduced as part of this process, labeled with TODOs.  These'll be
fixed as part of the VCF streaming project.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:52:22 +00:00
ebanks 848977678d No reason to convert the GLs to a String for formatting when they're just going to be converted to PLs later. That was 5% of the UG runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4913 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 22:06:19 +00:00
ebanks 8a0c07b865 Support for indels in hapmap. This was non-trivial because not only does hapmap not tell you whether the allele is an insertion or deletion, but it also has a completely different positioning strategy (rightmost base). I'll send out an email tomorrow when the new HapMap3.3 VCF is ready.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4908 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 07:37:46 +00:00
hanna e313eeede8 Push command-line expansions, such as BAM list unpacking and -B tag parsing, out
into the CommandLine* classes.  This makes it easier for external functionality
(such as the VCF streamer) to use GenomeAnalysisEngine directly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 19:00:17 +00:00
depristo a3729bd59c Now I call BeforeMethod correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:45:45 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 5dd0e8388b Fixed a bug in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4867 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 19:44:35 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
hanna acfe83920b '-L unmapped': adding integration tests for explicitly including (-L unmapped)
unmapped reads and explicitly excluding (-XL unmapped) unmapped reads, augmenting
the suite of unit tests already put in place.

'-L unmapped' seems safe to use; go for it, but please validate results against
samtools flagstat when the process finishes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4849 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 23:11:46 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00
chartl 5a27d231fa Rename it so that nobody else falls into the trap laid out (the test is VariantToTable, the walker is Variant[s]ToTable)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4844 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:43:00 +00:00
chartl 5e27e9162f Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:40:27 +00:00
kshakir 01323447c6 Removed LibBat.SUB2_BSUB_BLOCK since the use of it exits the JVM.
Fixed integration tests to wait on their own for the job to run instead of using SUB2_BSUB_BLOCK.
Updated VariantRecalibrationIntegrationTests MD5s which were knocked out of sync whele SUB2_BSUB_BLOCK was exiting in the middle of integration tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4840 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:57:20 +00:00
hanna 526ae92093 Getting back to '-L unmapped':
- basic unit tests for interval sorting and merging with mix of mapped/unmapped.
- validation to ensure that locus walkers (really all non-read walkers) blow up with a user error when -L unmapped is specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4837 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 18:24:18 +00:00
kshakir d550fdfd60 Disabling integration test to see if this restores the full test suite.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4833 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 15:27:02 +00:00
depristo abd6ce1c77 A TiTv-free approach for cutting variants! Apparently much better than previous approach, and will work for indels and SV will truly minor modifications to the code. Will discuss with methods group on Monday.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4822 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 23:08:13 +00:00
depristo 974aaa134d Trival fix to broken build
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4820 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 13:56:03 +00:00
kshakir 895cb39f41 Thanks to Platform Computing tech support, found the magical environment variable BSUB_QUIET.
Minor refactoring to add more of the CLibrary including setenv().

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4819 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 21:27:12 +00:00
depristo 5b46a900b3 Final version of BAQ calculation. default gap open is 1e-4, a good sensitive value. Useful timer class SimpleTimer added. BAQ is now live.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4818 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 19:35:12 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
depristo db55b2b0c6 Better testing of BAQ. Now really handles soft clipped reads properly by doing an expensive copy operation :-( will need to be transformed to a ByteBuffer in the near future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4807 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:37:00 +00:00
ebanks f1f01610f8 Remove the extra trailing tab at the end of the VCF ## header line. Unfortunately, this meant updating every freaking integration test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4806 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:22:29 +00:00
depristo 16e1bbd380 Hidden command line option to control BAQ gap open penalty for testing by me and eric. ValidateBAQWalker has misc. useful improvements. PrintReads now adds BAQ tags on output, if requested.
BAQ has generally useful improvements.  Refactor code to make it easier for BAQUnitTest to run.  minBaseQuality enforced on output, as well as input now.  Added BAQUnitTest that checks that the BAQ calculation is performing as expected.  Still needs to be expanded significantly.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4804 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 01:01:39 +00:00
depristo c91712bd59 BAQ calculation refactoring in the GATK. Single -baq argument can be NONE, CALCULATE_AS_NECESSARY, and RECALCULATE. Walkers can control bia the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
SAMFileWriterStub now supports BAQ writing as an internal feature.  Several walkers have the @BAQMode applied to this, with parameters that I think are reasonable.  Please look if you own these walkers, though

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4798 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:55:52 +00:00
depristo 5d2c2bd280 Just refactoring into utils/baq directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:43:43 +00:00
depristo 44feb4a362 Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:29:39 +00:00
depristo a5b3aac864 Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading *lots* of reference data all of the time
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:23:06 +00:00
fromer 92cf7744a6 Set minMQ = max(minMQ, minBQ) for phasing since anyway we cap BQ by MQ; also, lowered MIN_BASE_QUALITY_SCORE for phasing to 17 (was previously 20)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4781 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:31:13 +00:00
rpoplin e5282742f9 Bug fix in CountCovariates, skip over indel records as well as SNPs in the dbsnp file. CountCovariates is now called CountCovariatesWalker. I've always hated that the name was swapped.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4774 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 18:43:24 +00:00
rpoplin 0adf505b53 We no longer look at by-hapmap validation status in the VQSR because using the HapMap VCF file is higher quality. As a side effect we now support the dbsnp 132 vcf file. ApplyVariantCuts now requires that the input VCF rod bindings begin with input, matching the other VQSR walkers. Wiki updated with information about how to obtain the hapmap and 1kg truth sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4772 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 15:38:45 +00:00
fromer b4ef716aaf As per Eric and Mark's suggestions, separated the segregating MNP merger (MergeMNPs) from the more general merger employed for annotation purposes (MergeSegregatingAlternateAlleles). Both use the same core MergePhasedSegregatingAlternateAllelesVCFWriter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4763 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 16:42:08 +00:00
rpoplin af84462f3e The dev team has decided to change the filter that is added to records that are set to monomorphic by Beagle. It no longer lists the reference allele. Added those filters to the header of the output VCF file. Finally, we no longer use R2=NaN values coming from Beagle.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4757 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 17:19:54 +00:00
aaron 7f2ded0706 belated special case fix for Menachem; if the results of a BTI and BTIMR produce an empty interval list, exception out. This would be solved long term with better handling or empty and / or null interval lists. I'll add a JIRA
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4754 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 05:49:20 +00:00
aaron b03ac61e9d consolidating the checking of the RMD sequence dictionary against the reference into a single function, and adding an integration test to test that empty VCFs pass (both the indexing and the seq dictionary validation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4750 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 00:01:56 +00:00
ebanks d89e17ec8c Fare thee well, UGv1. Here come the days UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:51:19 +00:00
ebanks e3e6d176df Looking over the daily error log email made me realize that there were 2 implementations of vc.modifyLocation() - the correct one in VC that didn't require lazy loading the genotype data and the bad one in VCUtils that did. Removing the implementation in VCUtils and updating the code accordingly. Also, removing createPotentiallyInvalidGenomeLoc() since no one uses it anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4736 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 18:40:34 +00:00
ebanks 6934f83cc7 Two changes to CombineVariants.
1. Fix: VCs were padded before the merge, but they were never unpadded afterwards.  This leaves us with a VC that doesn't meet our spec.
2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-25 04:52:12 +00:00
rpoplin ed08899abc Overwhelming evidence that maxQ = 50 is now a better default than maxQ = 40 in the base quality score recalibrator, especially when combined with dbsnp build 132. Also, added option in ProduceBeagleInputWalker for Beagle-ing chromosome X calls with male samples which sets the genotype likelihood for the AB allele to zero for those samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4731 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 21:32:26 +00:00
hanna 082073ca3c Stop RBP.getPileupBySample() from throwing a NullPointerException if the
sample doesn't exist -- now returns null.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4719 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 05:17:06 +00:00
kshakir 787e5d85e9 Added the ability to test pipelines in dry or live mode via 'ant pipelinetest' and 'ant pipelinetest -Dpipeline.run=run'.
Added an initial test for genotyping chr20 on ten 1000G bams.
Since tribble needs logging support too, for now setting the logging level and appending the console logger to the root logger, not just to "org.broadinstitute.sting".
Updated IntervalUtilsUnitTest to output to a temp directory and not the SVN controlled testdata directory.
Added refseq tables and dbsnps to validation data in BaseTest.
Now waiting up to two minutes for gather parts to propagate over NFS before attempting to merge the files.
Setting scatter/gather directories relative to the -run directory instead of the current directory that queue is running.
Fixed a bug where escaping test expressions didn't handle delimiters at the beginning or end of the String.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4717 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 22:59:42 +00:00
bthomas 374c0deba2 Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 19:59:05 +00:00
kshakir 79725f2d9c Excluding the QFunction log files from the set of files to delete on completion.
When a QGraph is empty displaying a warning instead of crashing with an JGraph internal assertion error.
Cleaned up code using the Log4J root logger and explicitly talking to a logger for Sting.
When integration tests are run detecting that the logger has already been setup so that messages aren't logged twice.
Updated from Ivy 2.2.0-rc1 to 2.2.0.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4707 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:01 +00:00
depristo 721e8cb679 VariantsToTable now supports wildcard captures. -F PREFIX* now captures all fields that begin with PREFIX, output as a comma-separated list of unique values. Added integration test for VariantsToTable since I find it so useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4706 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:54:59 +00:00
hanna 90711d445c Change the interface for RMDTrackBuilder, therefore always mandating the specification
of a sequence dictionary and related info.  This will hopefully eliminate the cases in
which the refseq track depends a sequence dictionary / contig parser that hasn't been
specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 19:00:17 +00:00
depristo d86ab2becb JEXL expressions now generate exceptions, not warnings. Tools should catch the runtime exception to handle correctly. Removed unncessary complexity from the JEXL contexts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4695 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 16:08:16 +00:00
kshakir 01b721ab61 Passing ReviewedStingExceptions through the HMS.
Added a @Hidden experimental argument -validate to VariantEval that allows external JEXL assertions that must evaluate to true will throw an exception.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4692 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 21:50:42 +00:00
hanna 24ec35deaf - Reintroduce test dependency so that the tests passing / failing is not
dependent on the contents of the integrationtest directory.  Will figure
  out how to better manage the integrationtest directory at some point in
  the future.
- Up the max heap size for tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4691 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 19:55:20 +00:00
hanna 8ff4e4cb25 Cleanup testng listener configuration.
- Add StingTextReporter, which provides a text dump of the errors to the
  console.  Had to create our own reporter (inheriting from the standard
  TestNG TextReporter) to work around a configuration issue with the
  TextReporter.  In an ideal world, I'd report this on the TestNG mailing
  list and help them resolve the issue, but this solution is relatively
  robust at the moment and life is too short.
- Added back the failed test listener, which generates the testng-failed.xml
  file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4686 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 23:43:14 +00:00
depristo ef2f6d90d2 VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 22:19:22 +00:00
hanna 5b83942cee - Fix DepthOfCoverage so that, when it abuses the ROD system by instantiating a track in onTraversalDone, it also supplies the correct sequence dictionary and parser.
- Changed RMDTrackBuilder to use SequenceDictionaryUtils.validateDictionaries for ref <-> ROD sequence dictionary validation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4683 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 20:34:04 +00:00
kshakir 2fd816ac5f Updated ordering of integration tests. GVC > VR > AVC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4669 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 06:33:28 +00:00
depristo 44d0cb6cde New version of cutting routines for VQSR. Old code removed. Working unit tests. Best practice with testng integration test (everyone look at it). Walker test now allows you to not specify no. input files, if it can infer input counts from MD5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4664 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-13 16:19:56 +00:00
kshakir 62a106ca5a Disabled VariantGaussianMixtureModelUnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4663 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-13 03:53:33 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00
depristo 42acc968b1 Unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4660 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:09:39 +00:00
ebanks b51762c279 When you commit code late at night you tend to make careless mistakes... like forgetting to update integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4658 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 14:41:10 +00:00
depristo 988da428ae Bug fix for old style tranches file. ApplyVariantCuts moved over, and passes integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4657 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 14:38:26 +00:00
depristo c5f8c4dd0d VariantEval test for tranches file, plus cutting over VE to use the generic Tranches framework
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4656 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 13:52:40 +00:00