chartl
60ddc08cdf
Added a boatload of new case-control association modules. Switched the U-test to use longs rather than ints (it just so happened that I overflowed and started getting negative U statistics. Not good.) Added the ALL association type for ease of specifying that we want to throw the book at something. Added an svn-commit.tmp~ because i can't get rid of it even with --force. Hopefully I can remove it after.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5386 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 21:58:12 +00:00
depristo
af71576a07
CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:25:52 +00:00
chartl
a40a8006b5
Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
hanna
c40efe1dea
Fixed exception for BAMs without filenames (unit tests, BAM input streaming,
...
etc.).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5379 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 13:43:49 +00:00
depristo
ad51f30244
A trivial, but useful, sum of a list of integers
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5378 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-05 06:09:05 +00:00
chartl
9ca1dd5d62
Miscellaneous changes:
...
- RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it)
- VCFUtils: grabbing VCF headers given a prefix
- MathUtils: Useful functions for calculating statistics on collections of Numbers
- VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure.
- Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored
in the window optionally different from those returned by whatever statistic is run across the window
Added:
- MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that
it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation
for when one of the sample sets is small, and the other large. Unit tests to follow.
- BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations
- BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 22:43:36 +00:00
hanna
7a22f19366
More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated
...
case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and
ease-of-use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 03:55:24 +00:00
ebanks
660998065b
'Okay, now I'm absolutely certain that there are no more bugs in the constrained writer.'
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5353 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 03:48:40 +00:00
asivache
570186fa42
Added (deep) clone() and merge() to the RunningAverage utility class
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5350 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 00:35:23 +00:00
chartl
0723b0f44c
Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
delangel
d059d89a9d
Fixes and cleanups for indel eval module. Also outputs AT/CG ratio in dedicated column in IndelStatistics.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5332 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 12:07:50 +00:00
ebanks
05fac8583d
Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:40:54 +00:00
depristo
1dedfdb11b
Fixes for constrained movement Indel Realigner. Now sorts all of the reads in the interval before handing them to ConstrainedMateFixingSAMFileWriter to maintain correct contract between the two pieces of software
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5329 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 03:52:18 +00:00
ebanks
5d28cbda27
When crossing contigs it's crucial that the queue get flushed or else it will continue to accumulate reads without emitting. This is the last time I trust someone when they tell me that they are 'confident there are no bugs' in a tool.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5315 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 05:18:30 +00:00
rpoplin
1129f1535d
Fix for the HaplotypeScore optimization in AlignmentUtils
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5310 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 20:40:18 +00:00
rpoplin
255cc246a2
Change in Methods development pipeline: dbsnp130 can't be used for anything, changed it to dbsnp129. Optimization for HaplotypeScore and the to-be-committed ReadRosRankSumTest in AlignmentUtils
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5301 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 16:09:03 +00:00
ebanks
93888e570b
Phase 2: after hours of testing, confirming that constrained mode looks good so moving the integration tests over to use it. Some cleanup. More cleanup coming in Phase 3.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5298 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 06:23:41 +00:00
depristo
1a5d296737
ReplaceReadGroups. Fixes BAM files without read group info. MissingReadGroup points people to this tool now. Please point users on the forum to this tool now. Will migrate to Picard.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5284 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-21 14:02:41 +00:00
depristo
cd7a7091ba
Lexicographic error points users to the ReorderSam wiki entry
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5281 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-19 23:45:37 +00:00
kshakir
290afae047
GSA-423 Better reporting for errors in QScript.script().
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5276 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 22:21:15 +00:00
asivache
52eedaf22d
Subtle but very annoying bug due to incorrect exit condition on backward traversal. Example of incorrect old behavior (found by Martha Borkan, this normally would NOT happen with the combination of match/mismatch/open/extend parameters we have been using; use match=10.0, mismatch= -9.0, open= -15.0, extend= -6.66 in older builds in order to reproduce):
...
let's align two sequences (shown below, good alignment)
AAATTTGGTAAAA-GT
AAATTTGGTAAAAGGT
now let's reverse the same very sequences and align again
TGAAAATGGTTTAAA
TGGAAAATGGTTTAAA
Note how we lost the deletion and got a mismatch instead at the very first letter of the upper sequence. The overall score of any particular alignment does not depend on the direction of the traversal, so the best alignment (with the highest score) should stay the same too.
New version fixes this issue and produces correct alignment of reverse sequences (up to the different choice of redundant position for the deletion):
T-GAAAATGGTTTAAA
TGGAAAATGGTTTAAA
This version also has the main() method reinstated, so the aligner can be run on its own as a little app.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5255 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 00:02:32 +00:00
delangel
f3de9ee3e0
Refactoring of indel evaluation code to make it easier for external functions to get access to indel classification, in preparation for IndelMetricsByAC to stratify indel classes by AC (not done yet).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5219 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 17:35:16 +00:00
depristo
29f3ad72f3
SAMFileWriter that allows the user to move reads, but only a bit, in an incoming coordinated sorted BAM files. Does some local reordering and local mate fixing, under specified constrained. These constrains allow us to make a special -- under testing for Eric, who promised to try this out a bit, expand test cases and integration tests -- but soon to be the default and only model of the realigner that only moves reads with ISIZE < 3000 that directly emits a coordinate sorted, mate fixed validating BAM file without needing FixMates externally. Preliminary testing shows this runs in a totally fine amount of memory and produces equivalent results to the previous version.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5199 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:27:05 +00:00
depristo
11ea321b39
Trivial header cleanup
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5198 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:23:15 +00:00
depristo
0ad1ea4aa1
Fixed Umapped misspelling
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5196 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:21:41 +00:00
hanna
5c3198520c
A few minor modifications masquerading as significant changes according to
...
svn's logs:
- Copied BAM indexing engine from Picard back into the GATK anticipating
shard merging algorithm. Tried to leave most of the building blocks in
Picard. If this turns into a logistical nightmare, I'll merge the building
blocks into the GATK as well.
- Reorganized the org.broadinstitute.sting.gatk.datasources package, giving
better separation of query and management functionality for reads, ref, rmd,
and samples.
- Merged Shard building blocks into org.broadinstitute.sting.gatk.datasources.
reads package, indicating it's current strong relationship with the reads,
rather than the general unifying element I wish this would be.
- Collapsed BAMFormatAwareShard into Shard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5184 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:59:19 +00:00
ebanks
43fb11b923
Removing stray non-ASCII character
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5171 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 03:10:08 +00:00
hanna
25f045cac6
Changing locking errors to warnings. This will hopefully allow us to diagnose
...
the mysterious failure in STING_INTEGRATION-3832, the next time it appears.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5164 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 16:29:31 +00:00
kshakir
d4f744a4d4
Checking if the interval files exist before using them to calculate the minimum scatter parts.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5143 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 18:07:34 +00:00
kshakir
2ef66af903
Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
...
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
depristo
2182b8c7e2
Better query start / stop function that directly parses the cigar string, unlike the previous version. Now properly handles H (hard-clipped) reads. Added -baq OFF and -baq RECALCULATE integration tests on all three 1KG technologies. Please let me know if this new code somehow fails.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5108 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 15:08:21 +00:00
depristo
f29bb0639b
Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:40:09 +00:00
depristo
5ed128f839
Slightly more tolerant timing setting. Main() method in GenomeLocProcessTracker to generating timing data for trackers.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5097 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 15:16:07 +00:00
depristo
be697d96f9
An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 13:45:40 +00:00
hanna
862b299b47
Fix Picard OTF index generation issue.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5077 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 03:42:46 +00:00
depristo
c50f39a147
V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 16:45:07 +00:00
depristo
a51061fd96
Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
ebanks
bb6999b032
Better documentation
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5057 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 03:36:09 +00:00
depristo
c52d2d5f79
Bug fix for SimpleTimer that didn't always convert elapsed times from milliseconds to seconds
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5055 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-22 18:50:59 +00:00
depristo
9b1b8d46aa
Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:24:42 +00:00
kshakir
8855f080c2
For the fullCallingPipeline.q:
...
- Reading the refseq table from the YAML if not specified on the command line.
- Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
- Added a -mountDir /broad/software option to work around adpr automount issues.
- Merged the LSF preexec used for automount into the shell script used to execute tasks.
- Using the LSF C Library to determine when jobs are complete instead of postexec.
- Updated queue.sh to match the changes above.
- Updated the FCPTest to match the changes above.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
depristo
e4ac1e6171
Removing unused file
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5033 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 13:03:55 +00:00
depristo
85553cf5cb
V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
...
See documentation at:
http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29
for examples on how to run this, or the testing Scala script
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo
41c8552d0a
Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
depristo
f8ba76d87c
Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
depristo
a88708ebfa
Moving GLF code to archive
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-15 22:42:42 +00:00
depristo
afbea9ce59
SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 03:14:24 +00:00
hanna
c0031b05ff
Stamp out lazy loading in the PluginManager. This is an attempt to stamp
...
out the non-deterministic VariantEvalIntegrationTest errors we've been seeing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4995 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 20:58:28 +00:00
fromer
ffae7bf537
Moved phasing-specific utilities to phasing sub-directory
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4987 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 15:38:20 +00:00
depristo
91824f478e
FASTQ directory is gone
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4986 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 15:16:06 +00:00