Commit Graph

1014 Commits (f29bb0639b76da656e368317553d103070862d23)

Author SHA1 Message Date
depristo f29bb0639b Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:40:09 +00:00
depristo f522eb2848 Previous tests were just too big...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5095 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 13:48:38 +00:00
hanna 4a33cdacde Some basic integration tests detecting breakage in OTF BAM index generation.
Doing it manually for the moment so that there's at least something testing
this capability; will followup eventually with Mark to see whether we can
shape the VCF index generation code in such a way that it supports BAM index
testing as well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5093 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 23:48:04 +00:00
ebanks dfc5a3d1f3 added integration test for --sites_only option
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5082 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:58:15 +00:00
depristo be697d96f9 An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 13:45:40 +00:00
kshakir 9923e05e0a Moved MD5 utils from WalkerTest to BaseTest for use by PipelineTests.
Moved VariantEval validation from FCPTest to PipelineTest.
Cleaned up some duplicate code for writing temp files during tests.
Moved FCPTest to playground namespace to match move for FCP.q.
Added a basic HelloWorldPipelineTest for the HelloWorld QScript. 
Moved duplicated error handling from JobRunners into the FunctionEdge.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5068 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 04:11:49 +00:00
hanna 9db02059ac Fix for Ryan's issue: reads ending with indel distort the location of the
pileup, resulting a two map() calls for the same locus (and no map call for
the locus immediately following).
Fixed bug and added comprehensive unit tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5067 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 19:49:39 +00:00
depristo c50f39a147 V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 16:45:07 +00:00
depristo a51061fd96 Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
ebanks 2d4bcb60a1 Don't print out alt alleles for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5060 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:33:31 +00:00
ebanks 2bbcc9275a Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 05:04:17 +00:00
depristo 9b1b8d46aa Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:24:42 +00:00
hanna aea121a9d5 <key>=<value> tagging support for command-line arguments. Unfortunately, still
very hard to validate and still very hard to use (requires core hacking to 
support additional tags).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5038 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 00:22:42 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
hanna 8831ec3dce Some refactoring and cleanup around the area of my sleep-deprived integration
test typo, which Khalid already fixed for me.  Sorry, Khalid!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5035 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 15:03:14 +00:00
kshakir 3022f4dfa0 Fixed missing space character in testSimpleVCFStreaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5034 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 14:49:38 +00:00
depristo 85553cf5cb V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
See documentation at:

http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29

for examples on how to run this, or the testing Scala script

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
hanna 7087c2f422 Very simple integration tests for basic VCF streaming functionality.
Rather than try to fork the integration test process to get a pipe source
and sink, creates a new named pipe by Runtime.exec()ing the 'mkfifo' shell
command.  We'll see whether this proves to be a reliable method for testing
streaming.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5028 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 04:38:54 +00:00
kshakir 2b895ffb7f Updated the HG19 reference from v0 to v1 after the v0 was zeroed out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5023 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 20:30:25 +00:00
depristo f8ba76d87c Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
depristo a88708ebfa Moving GLF code to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-15 22:42:42 +00:00
hanna af31d02a2d Fix concurrency issue that periodically kills VariantEvalIntegrationTest --
a member field of RMDTrackBuilder was getting rebuilt every time it was
called, creating concurrency issues.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5001 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 18:52:21 +00:00
depristo af1bce3492 Longer wait time for threading test (5 min now) and an assertion to ensure that all jobs finished. Should probably just remove the longer running test so this becoming a non-issue
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4999 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 13:09:31 +00:00
depristo afbea9ce59 SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 03:14:24 +00:00
hanna 02dc0f97d1 Remove testWalkerUnitTest; it doesn't actually do anything and just adds
extra cruft to the output.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4993 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 19:02:00 +00:00
rpoplin ce3d226183 Reverting back to the old definition of QD because it works better with large numbers of samples. The new QD is relegated to a new annotation: sumGLbyD. Tweaks to the new HaplotypeScore based on evaluation with better QD calculation. The default qual threshold in GenerateVariantClusters is updated to be in line with the variant quality scores coming from the exact model.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4984 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:12:30 +00:00
hanna edebbb5aa0 Fixed long-standing bug reported by Mauricio where @Arguments assigned to
primitive types were properly validated and throw the proper 
MissingArgumentValue UserException.  Before this fix, the error reported
was the infamous DePristo BSOD (Could not create module String because 
an exception of type NullPointerException occurred caused by exception null).

Thanks Mauricio!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4980 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 22:18:24 +00:00
hanna 6d855041ec Oops...forgot to commit the changes that allow primitive VCF streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 21:54:51 +00:00
depristo 8fe5641b2e can explicitly set the now required ReferenceDataSource in unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4977 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:25:12 +00:00
aaron 7916ab0ed5 remove the index each run
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4976 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:38:22 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
aaron cba436fa2f small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 21:20:26 +00:00
hanna 0982d35f5b Bug fixes in streaming in Tribble data via /dev/stdin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 02:43:04 +00:00
rpoplin 23dbc5ccf3 HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 00:28:05 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
hanna 3fc9862964 Unit test fixed - Tribble codecs aren't designed to be stateless, but I was
using one as though it was.  Fixed, and debug code reverted.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 17:47:52 +00:00
hanna b9cb57f4b9 A unit test is failing on bamboo in a way I can't reproduce (or even explain).
Checking in some debugging info.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 16:35:04 +00:00
hanna cba18116e4 A significant refactoring of the ROD system, done largely to simplify the process of
streaming/piping VCFs into the GATK.  Notable changes:
- Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build 
  RMDTracks and lookup codecs.
- RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now
  manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class
  are the walkers that feel the need to access the ROD system directly.  (We need to stamp out
  this access pattern.
A few minor warts were introduced as part of this process, labeled with TODOs.  These'll be
fixed as part of the VCF streaming project.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:52:22 +00:00
ebanks 848977678d No reason to convert the GLs to a String for formatting when they're just going to be converted to PLs later. That was 5% of the UG runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4913 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 22:06:19 +00:00
ebanks 8a0c07b865 Support for indels in hapmap. This was non-trivial because not only does hapmap not tell you whether the allele is an insertion or deletion, but it also has a completely different positioning strategy (rightmost base). I'll send out an email tomorrow when the new HapMap3.3 VCF is ready.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4908 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 07:37:46 +00:00
hanna e313eeede8 Push command-line expansions, such as BAM list unpacking and -B tag parsing, out
into the CommandLine* classes.  This makes it easier for external functionality
(such as the VCF streamer) to use GenomeAnalysisEngine directly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 19:00:17 +00:00
depristo a3729bd59c Now I call BeforeMethod correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:45:45 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 5dd0e8388b Fixed a bug in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4867 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 19:44:35 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
hanna acfe83920b '-L unmapped': adding integration tests for explicitly including (-L unmapped)
unmapped reads and explicitly excluding (-XL unmapped) unmapped reads, augmenting
the suite of unit tests already put in place.

'-L unmapped' seems safe to use; go for it, but please validate results against
samtools flagstat when the process finishes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4849 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 23:11:46 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00