Commit Graph

220 Commits (a51061fd96480cea53425d337c19fdf120ed141e)

Author SHA1 Message Date
depristo a51061fd96 Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
depristo 9b1b8d46aa Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:24:42 +00:00
depristo 85553cf5cb V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
See documentation at:

http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29

for examples on how to run this, or the testing Scala script

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo f8ba76d87c Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
depristo a88708ebfa Moving GLF code to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-15 22:42:42 +00:00
depristo af1bce3492 Longer wait time for threading test (5 min now) and an assertion to ensure that all jobs finished. Should probably just remove the longer running test so this becoming a non-issue
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4999 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 13:09:31 +00:00
depristo afbea9ce59 SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 03:14:24 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
depristo a3729bd59c Now I call BeforeMethod correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:45:45 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
hanna acfe83920b '-L unmapped': adding integration tests for explicitly including (-L unmapped)
unmapped reads and explicitly excluding (-XL unmapped) unmapped reads, augmenting
the suite of unit tests already put in place.

'-L unmapped' seems safe to use; go for it, but please validate results against
samtools flagstat when the process finishes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4849 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 23:11:46 +00:00
hanna 526ae92093 Getting back to '-L unmapped':
- basic unit tests for interval sorting and merging with mix of mapped/unmapped.
- validation to ensure that locus walkers (really all non-read walkers) blow up with a user error when -L unmapped is specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4837 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 18:24:18 +00:00
depristo 974aaa134d Trival fix to broken build
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4820 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 13:56:03 +00:00
depristo db55b2b0c6 Better testing of BAQ. Now really handles soft clipped reads properly by doing an expensive copy operation :-( will need to be transformed to a ByteBuffer in the near future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4807 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:37:00 +00:00
depristo 16e1bbd380 Hidden command line option to control BAQ gap open penalty for testing by me and eric. ValidateBAQWalker has misc. useful improvements. PrintReads now adds BAQ tags on output, if requested.
BAQ has generally useful improvements.  Refactor code to make it easier for BAQUnitTest to run.  minBaseQuality enforced on output, as well as input now.  Added BAQUnitTest that checks that the BAQ calculation is performing as expected.  Still needs to be expanded significantly.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4804 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 01:01:39 +00:00
depristo 44feb4a362 Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:29:39 +00:00
aaron 7f2ded0706 belated special case fix for Menachem; if the results of a BTI and BTIMR produce an empty interval list, exception out. This would be solved long term with better handling or empty and / or null interval lists. I'll add a JIRA
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4754 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 05:49:20 +00:00
hanna 082073ca3c Stop RBP.getPileupBySample() from throwing a NullPointerException if the
sample doesn't exist -- now returns null.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4719 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 05:17:06 +00:00
kshakir 787e5d85e9 Added the ability to test pipelines in dry or live mode via 'ant pipelinetest' and 'ant pipelinetest -Dpipeline.run=run'.
Added an initial test for genotyping chr20 on ten 1000G bams.
Since tribble needs logging support too, for now setting the logging level and appending the console logger to the root logger, not just to "org.broadinstitute.sting".
Updated IntervalUtilsUnitTest to output to a temp directory and not the SVN controlled testdata directory.
Added refseq tables and dbsnps to validation data in BaseTest.
Now waiting up to two minutes for gather parts to propagate over NFS before attempting to merge the files.
Setting scatter/gather directories relative to the -run directory instead of the current directory that queue is running.
Fixed a bug where escaping test expressions didn't handle delimiters at the beginning or end of the String.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4717 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 22:59:42 +00:00
bthomas 374c0deba2 Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 19:59:05 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00
hanna 8e36a07bea Convert GenomeLocParser into an instance variable. This change is required
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.  

GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either

-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()

This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet.  This
will become clearer when contig aliasing is implemented.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 17:59:50 +00:00
hanna 8f9bf82aa7 Bamboo is correctly interpreting test fails. Reverting forced-fail test
code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4617 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-02 19:32:34 +00:00
hanna 1df166b76e Forcing a unit test fail to ensure that Bamboo is picking up on failed tests
as well as successes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4616 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-02 19:03:12 +00:00
hanna 861ee3e37a Changing testing framework from junit -> testng, for its enhanced configurability.
Initial test to see how Bamboo will respond.  More detailed email to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 21:31:44 +00:00
asivache aadd230636 N-Way-Out is back. Now uses SAMReadID to identify each read's source bam, so should be reliable. Interface is sort of ugly fo now: to generate output file names, .bam is stripped from input file names, then the value of -nWayOut argument is pasted on (and all the output files are written into the current dir).
Unrelated change: in the sorted-target mode (when we read sorted target intervals one by on from a file), one can now specify multiple semicolon-separated interval files (all must be sorted). Not hugely useful probably, but makes --targetIntervals always process its values in exactly the same way, so we are consistent  (it has been already taking ;-separated args in unsorted mode)

NwayIntervalMergingIterator: reads in multiple sorted GenomeLoc input streams (iterators) and presents them as a single sorted and merged stream

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4602 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 16:06:51 +00:00
depristo 38a67fed63 High performance version of standard vcf writer. New general static Tribble class for common constants, including general .idx constant and functions to get standard index name for a given file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4471 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:53:21 +00:00
hanna 78343be52c At some time in the recent past, we lost our ability to process the '-L all'
argument.  Brought it back, and added an integrationtest to make sure it
stays around.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4390 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 15:58:43 +00:00
hanna 497bcbcbb7 Recent changes to the build system make the build system complain loudly about
pieces of core that depend on playground.  Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin.  I'll 
follow up with other changes in separately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 22:09:12 +00:00
aaron 782e0018e4 removal of most of the old GATK ROD system; also a fix for -Dsingle so we can again run just a single unit or integration test (single tests in tribble can be run with the -DsingleTest option now). More to come.
*** Three integration tests had to change: ***

RecalibarationWalkersIntegrationTest:
One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates)

SequenomValidationConverterIntegrationTest:
relies on Plink ROD which we've removed.  

PileupWalkerIntegrationTest: 
we no longer have implicit interval tracks, so there isn't a rod name over the specified region.  Otherwise the same result.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 22:54:49 +00:00
hanna 7fa6b2135b Added a back door so that integration tests can reset the sequence dictionary
in the reference.  Reset routine is not accessible to any class outside
GenomeLocParser's package.

We'll have to do something more intelligent with this when the GATK goes
distributed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4275 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 18:58:08 +00:00
depristo 4d0ff336c2 Missed update input
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4269 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:46:13 +00:00
depristo 7880863eb7 Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:07:38 +00:00
depristo 7ad8fbdd5a Moved GATKException to exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:47:19 +00:00
depristo 1876c9856a Moved stingexception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4265 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:39:22 +00:00
depristo 595907e98e Moving StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:34:15 +00:00
depristo 40e6179911 Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:02:43 +00:00
hanna dc5f858d29 Replaced placeholder support for splitting by read group with read support (sorry everyone), and added relatively comprehensive unit tests to ensure that splitting by read group works.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4190 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 22:24:50 +00:00
hanna b80cf7d1d9 Modifications to the output system for better interaction with @Output. Multiplexed arguments. More details in the Monday meeting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4077 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-22 14:27:05 +00:00
hanna cb144734c0 Getting rid of GenotypeWriter interface. Of note:
- GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble.
- VCFWriter is now an interface, for easier redirection.
- VCFWriterImpl fleshes out the VCFWriter interface.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 16:33:22 +00:00
aaron 30178c05c5 providing a way to specify how you'd like -BTI combined with your -L options; set BTIMR to either UNION (default) or INTERSECTION.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3983 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 14:00:52 +00:00
aaron 9076c0b28b removing unused code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3958 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 14:24:39 +00:00
aaron 72ae81c6de VariantContext has now moved over to Tribble, and the VCF4 parser is now the only VCF parser in town. Other changes include:
- Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from 
inside the tribble directory.
- Hapmap ROD now in Tribble; all mentions have been switched over.
- VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc.
- VariantContext.getSNPSubstitutionType is now in VariantContextUtils.
- This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN

I'll send out an email to GSAMembers with some more details.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 18:47:53 +00:00
asivache d53d5ffbf6 A utility class that computes running average and standard deviation for a stream of numbers it is being fed with. Updates mean/stddev on the fly and does not cache the observations, so it uses no memory and also should be stable against overflow/loss of precision. Simple unit test is also provided (does *not* stress-test the engine with millions of numbers though).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3944 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 21:39:02 +00:00
ebanks 340bd0e2c1 Removed hard-coded pointers to references
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 17:59:37 +00:00
delangel 26bb1cd9ce Fix broken test correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3869 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 20:47:41 +00:00
delangel 4fc1db7aaf Change interface to VCFWriter add() method to take only 1 byte from reference (since that's the only thing it needs), to prevent bugs like having people call it with ref.addBases() which is wrong (since it provides bases starting from the left of reference context window).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3868 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 20:24:03 +00:00