Commit Graph

2227 Commits (fc73569d62b3695aebc544b753f089caf7aeb482)

Author SHA1 Message Date
depristo 85553cf5cb V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
See documentation at:

http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29

for examples on how to run this, or the testing Scala script

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
depristo cacdac3914 Major refactoring of shards. No longer uses interfaces but is now an actual object hierarchy with most of the important and common functionality pushed up to base classes. Eliminated a lot of duplicated code, and the shards are much more understandable now. Also now require a GenomeLocParser to work with their own GenomeLocs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5030 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:36:56 +00:00
rpoplin 24bc843ae8 Dynamically change the log message update rate so that short jobs receive frequent updates while longer running jobs receive fewer updates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5016 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 15:09:11 +00:00
rpoplin bd2af33a16 misc clean up in VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5014 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 21:04:31 +00:00
rpoplin 00453919d2 VQSR now only uses the valid polymorphic sites for training and truth sensitivity calculations. Any number of tracks whose ROD binding begins with the name truth can be used as truth sensitivity tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5012 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 20:48:19 +00:00
depristo f8ba76d87c Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
ebanks 366c3a0b8f Incompatible chain files are user exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5008 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-16 05:26:47 +00:00
hanna 579e0d59fa Rewrote warning message to discourage use of unsafe mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5003 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 21:32:53 +00:00
hanna af31d02a2d Fix concurrency issue that periodically kills VariantEvalIntegrationTest --
a member field of RMDTrackBuilder was getting rebuilt every time it was
called, creating concurrency issues.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5001 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 18:52:21 +00:00
hanna bfbf75fe3e Fix error in command-line validation: don't ever allow intervaled access to unindexed read stream, no
matter what type of traversal it is.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4997 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 02:49:04 +00:00
delangel 00310c05bb Fix corner condition that happens when there are indels right at the end of a contig and there's not enough reference to build a haplotype.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4996 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 21:08:22 +00:00
fromer b107c97c1a Cannot have "=" sign in reason, so change to ":"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4991 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 17:23:44 +00:00
fromer b4a2112a0d Added the "previous locus" to interesting sites VCF (locus with respect to which the site is phased)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4990 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 17:19:20 +00:00
fromer e8f0ae4b09 Renamed and documented some phasing-specific classes to make their purpose clearer to someone browing through the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4989 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 16:17:36 +00:00
fromer ffae7bf537 Moved phasing-specific utilities to phasing sub-directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4987 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 15:38:20 +00:00
rpoplin ce3d226183 Reverting back to the old definition of QD because it works better with large numbers of samples. The new QD is relegated to a new annotation: sumGLbyD. Tweaks to the new HaplotypeScore based on evaluation with better QD calculation. The default qual threshold in GenerateVariantClusters is updated to be in line with the variant quality scores coming from the exact model.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4984 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:12:30 +00:00
hanna e0092bb160 Experimental feature: change the rate at which log messages appear on-the-fly
and enable/disable performance logs from outside the JVM process.  Making this
available for the moment; we'll see whether it ends up being useful.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4983 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 04:20:53 +00:00
carneiro 9e93091e9a -baqGOP now takes phred scaled scores instead of probabilities in the command line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4982 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 00:06:38 +00:00
hanna 6d855041ec Oops...forgot to commit the changes that allow primitive VCF streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 21:54:51 +00:00
delangel 8a6b126ea8 Several cleanups to IndelMetricsByAC:
- No longer a standard eval module to keep integration tests happy
- Remove class name overlaps with SimpleMetricsByAC so that modules don't overwrite each other's files, and to make it easier to grep results.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4978 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:35:24 +00:00
depristo 8fe5641b2e can explicitly set the now required ReferenceDataSource in unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4977 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:25:12 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
delangel bdd382198c Necessary changes to enable HaplotypeScore annotation for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4974 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:09:12 +00:00
delangel 23597a2bde Variant Eval module that collects indel statistics (basic counts and event sizes) and partitions by AC (similar to SimpleMetricsByAC in the SNP case)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4973 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:08:09 +00:00
fromer 48052907a6 A hom genotype can always be considered phased
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4972 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-11 18:48:48 +00:00
fromer c2dd956888 Moved PrintReferenceVariantsWalker to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4971 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 22:07:41 +00:00
ebanks ee348ac9d4 Add a hidden mode to the realigner to turn off SW but still use indels other than known ones (i.e. those already in the reads)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4969 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 20:27:04 +00:00
fromer 01c2091cd9 A LocusWalker to print the haploid reference genome as a VCF file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4968 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 16:59:41 +00:00
delangel 9648399630 Boneheaded silly bug in indel caller - posterior probability computation was using priors gotten from SNP heterozygosity, not indel heterozygosity. Added then indel het. argument to command line and hook it up (not a radical change in calls though, just a few dubious calls around the edges fall off)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4967 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 14:56:28 +00:00
aaron b24e1134f9 unfortunately samrecord pileup also uses zero length intervals to indicate deletions; this will have to be a BED specific exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4964 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:32:50 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
ebanks 60f45a7c49 Stupid me. Forgot to put this check in the last commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4959 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:16:41 +00:00
aaron 56b87da8f9 a better error message for the situation where a RMD track generates a negitive length interval; the user will now see a message like "Bad input: A feature produced by the reference metadata track named "bed" at position chr1:10434-10433 has a start greater than the stop; this is an invalid position "
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4958 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:06:04 +00:00
ebanks 4272b824d6 unused imports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4957 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:33:12 +00:00
ebanks 2ac5c52281 Better error message as per Mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4953 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:44:02 +00:00
ebanks e0d091b3db Die gracefully if the bam is malformed with quals that are too high
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4952 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:39:08 +00:00
kiran d88fd7212f Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4948 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:27:19 +00:00
kiran 307c41c128 Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4947 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:26:38 +00:00
kiran e9201b81d1 A more general method for specifying samples to act on from the command-line. Supports samples specified individually on the console, a file of samples, or regular expressions to select multiple samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4945 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 14:54:56 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
aaron cba436fa2f small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 21:20:26 +00:00
fromer 4b37710bcd Added validator for phasing using read information, e.g., PacBio: ReadBasedPhasingValidationWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4940 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:05:56 +00:00
delangel d203f5e39a Experimental change in how we classify indels - up to now, an indel of say AA was counted as a 2-mer repeat expansion. But in reality, if the event is sounded by A's it's really a multiple monomer expansion. So, we first reduce the indel bases in case they are made of repeated elements before classifying them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4939 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 17:13:18 +00:00
rpoplin 4ac0590744 Fix for NaNs in the rank sum tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4938 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 15:21:30 +00:00
hanna 7cdaffbe5c Create tmpdir if it doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4936 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 03:07:11 +00:00
hanna 0982d35f5b Bug fixes in streaming in Tribble data via /dev/stdin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 02:43:04 +00:00
rpoplin 23dbc5ccf3 HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 00:28:05 +00:00
ebanks 85714621be Better interface to Genotypelikelihoods class. Now you need to specify the format (GL vs PL) of the output string when calling getAsString(). All likelihoods are represented as GLs internally. QualByDepth no longer does its own conversion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4933 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 21:48:14 +00:00
ebanks 96729acd0d Optional argument to put the original position into the INFO field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4930 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 19:22:44 +00:00