Commit Graph

4080 Commits (7b92cd5008a3c58f17a7523abd07c1ed764d4e98)

Author SHA1 Message Date
hanna e0092bb160 Experimental feature: change the rate at which log messages appear on-the-fly
and enable/disable performance logs from outside the JVM process.  Making this
available for the moment; we'll see whether it ends up being useful.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4983 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 04:20:53 +00:00
carneiro 9e93091e9a -baqGOP now takes phred scaled scores instead of probabilities in the command line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4982 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 00:06:38 +00:00
hanna 5736d2e2bb Something I should have done a long time ago: attempt to detect whitespace
after the line continuation backslash and enhance the error message if it
appears.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4981 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 23:15:08 +00:00
hanna edebbb5aa0 Fixed long-standing bug reported by Mauricio where @Arguments assigned to
primitive types were properly validated and throw the proper 
MissingArgumentValue UserException.  Before this fix, the error reported
was the infamous DePristo BSOD (Could not create module String because 
an exception of type NullPointerException occurred caused by exception null).

Thanks Mauricio!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4980 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 22:18:24 +00:00
hanna 6d855041ec Oops...forgot to commit the changes that allow primitive VCF streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 21:54:51 +00:00
delangel 8a6b126ea8 Several cleanups to IndelMetricsByAC:
- No longer a standard eval module to keep integration tests happy
- Remove class name overlaps with SimpleMetricsByAC so that modules don't overwrite each other's files, and to make it easier to grep results.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4978 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:35:24 +00:00
depristo 8fe5641b2e can explicitly set the now required ReferenceDataSource in unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4977 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:25:12 +00:00
aaron 7916ab0ed5 remove the index each run
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4976 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:38:22 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
delangel bdd382198c Necessary changes to enable HaplotypeScore annotation for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4974 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:09:12 +00:00
delangel 23597a2bde Variant Eval module that collects indel statistics (basic counts and event sizes) and partitions by AC (similar to SimpleMetricsByAC in the SNP case)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4973 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:08:09 +00:00
fromer 48052907a6 A hom genotype can always be considered phased
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4972 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-11 18:48:48 +00:00
fromer c2dd956888 Moved PrintReferenceVariantsWalker to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4971 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 22:07:41 +00:00
kshakir 8ba3a5a43f Command lines for locally run Queue jobs no longer have to be escaped differently than bsub'ed jobs.
GSA-410 Local job runs now can run command lines longer than than 4096 on our linux machines.
When determining if the help text and Queue extensions need to be rebuilt, use the .class files not the .java so that GATK oneoffs are picked up correctly.
Added the most basic of all example QScripts for debugging, Hello World.
Minor updates to copy/pasted LSF code to reduce ant javadoc warnings by a third.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4970 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 21:07:29 +00:00
ebanks ee348ac9d4 Add a hidden mode to the realigner to turn off SW but still use indels other than known ones (i.e. those already in the reads)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4969 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 20:27:04 +00:00
fromer 01c2091cd9 A LocusWalker to print the haploid reference genome as a VCF file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4968 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 16:59:41 +00:00
delangel 9648399630 Boneheaded silly bug in indel caller - posterior probability computation was using priors gotten from SNP heterozygosity, not indel heterozygosity. Added then indel het. argument to command line and hook it up (not a radical change in calls though, just a few dubious calls around the edges fall off)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4967 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 14:56:28 +00:00
aaron b24e1134f9 unfortunately samrecord pileup also uses zero length intervals to indicate deletions; this will have to be a BED specific exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4964 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:32:50 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
ebanks 60f45a7c49 Stupid me. Forgot to put this check in the last commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4959 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:16:41 +00:00
aaron 56b87da8f9 a better error message for the situation where a RMD track generates a negitive length interval; the user will now see a message like "Bad input: A feature produced by the reference metadata track named "bed" at position chr1:10434-10433 has a start greater than the stop; this is an invalid position "
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4958 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:06:04 +00:00
ebanks 4272b824d6 unused imports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4957 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:33:12 +00:00
chartl 3e7802a3e0 Minor changes to a qscript and the GQ constants on PrivatePermutations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4956 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:26:21 +00:00
kiran 79fcff13ff Fixed import statement that was erroneously referring to VE3 rather than VE2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4955 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 03:22:25 +00:00
ebanks f3ca2cc9de Add safety net to BAQ calculation: explicitly cast to byte/int and check for bad values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4954 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 18:09:12 +00:00
ebanks 2ac5c52281 Better error message as per Mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4953 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:44:02 +00:00
ebanks e0d091b3db Die gracefully if the bam is malformed with quals that are too high
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4952 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:39:08 +00:00
kiran 3163970ad5 Updates that slipped from my last commit: fixed some imports and calls to super().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4951 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:34:40 +00:00
kiran d88fd7212f Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4948 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:27:19 +00:00
kiran 307c41c128 Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4947 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:26:38 +00:00
kiran fdc514ded3 Intermediate commit for VariantEval 3.0. Among the changes:
* Stratifications (by comp rod, by eval rod, novelty, filter status, etc.) have been generalized.  They are very symmetric with evaluators now.  Each stratification can have multiple states (e.g. known, novel, all).  New stratifications can be added and optionally applied.  Some new stratifications include:
  - by sample
  - by functional class
  - by CpG status

* Output is to a single file in GATKReport format, rather than having the options of CSV, R, table, etc.

* Rather than needing to state up front that the allowable variant type is a SNP or an indel, each eval record is inspected and the appropriate record type is fetched from the comp track.  (This will require a bit more testing...)

* Evaluation context (basically a single row in a VariantEval report) generation and retrieval has been overhauled.  Now, every possible configuration of stratification state is generated recursively and stored in a HashMap.  The key of the HashMap is a key that represents that exact state configuration.  When examining a comp track and eval track, this key is computed based on the data, providing easy lookup for the appropriate evaluation context.  When there are only a handful of stratification configurations, this isn't a big deal.  But when operating on a file with hundreds of samples, multipled by 3 states for novelty, 3 states for filtration, 3 states for CpG status, etc., it becomes a very big deal.

There are still some known issues:
* When the per-sample stratification is turned off, things are getting overcounted (too many variants are showing up when compared to the VariantEval 2.0 code).  It's probably because I break out the VariantContext by sample even when not necessary, and those irrelevant contexts are still being counted.  Or my recursion is overaggressively creating evaluation contexts, and they all get added up in a weird way.  But that's why I'm committing now - so I can track down this issue without losing my work so far.

* The Jexl expressions are sometimes throwing an exception that I don't yet understand (they complain of an incorrect specification on the command-line... *after* the program has made it through a few thousand records.

* The request to have evaluations be smart enough to reject certain stratification states is not implemented yet.

There's still some work to do before I can replace VariantEval 2.0 with VariantEval 3.0, but feel free to take a look.  I'd love comments on the new code.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4946 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:20:24 +00:00
kiran e9201b81d1 A more general method for specifying samples to act on from the command-line. Supports samples specified individually on the console, a file of samples, or regular expressions to select multiple samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4945 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 14:54:56 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
aaron cba436fa2f small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 21:20:26 +00:00
fromer 4b37710bcd Added validator for phasing using read information, e.g., PacBio: ReadBasedPhasingValidationWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4940 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:05:56 +00:00
delangel d203f5e39a Experimental change in how we classify indels - up to now, an indel of say AA was counted as a 2-mer repeat expansion. But in reality, if the event is sounded by A's it's really a multiple monomer expansion. So, we first reduce the indel bases in case they are made of repeated elements before classifying them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4939 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 17:13:18 +00:00
rpoplin 4ac0590744 Fix for NaNs in the rank sum tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4938 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 15:21:30 +00:00
chartl 445ae06a7a Re-add PrivatePermutations since ACTransitionTable is a little too memory-intensive to generate all the cuts that I need
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4937 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 06:11:18 +00:00
hanna 7cdaffbe5c Create tmpdir if it doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4936 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 03:07:11 +00:00
hanna 0982d35f5b Bug fixes in streaming in Tribble data via /dev/stdin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 02:43:04 +00:00
rpoplin 23dbc5ccf3 HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 00:28:05 +00:00
ebanks 85714621be Better interface to Genotypelikelihoods class. Now you need to specify the format (GL vs PL) of the output string when calling getAsString(). All likelihoods are represented as GLs internally. QualByDepth no longer does its own conversion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4933 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 21:48:14 +00:00
ebanks 96729acd0d Optional argument to put the original position into the INFO field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4930 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 19:22:44 +00:00
delangel caedfed860 Fix bug where indels being incorrectly classified in VariantEval module
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4929 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 18:01:48 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
depristo d31c658c2e Organized performance monitoring passes unit tests and is more efficient
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4924 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:09:08 +00:00
depristo c51e745bae The engine can be null in a unit test, so check for it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4923 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 01:00:52 +00:00
depristo 75a7d8a76e Trivial formatting error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4922 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 23:44:36 +00:00
depristo 5539c2d9f3 --performanceLog (-PF) X.dat argument now enabled. Writes out a table (R-friendly) of the performance of the GATK over time, exactly as a more detailed version of the INFO progress meter. R script for useful plotting of the performance of the GATK over time. Will be helpful for upcoming scalability testing and debugging of memory leaks and other incremental performance problems
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4921 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 23:34:21 +00:00
depristo 4c9746f463 Disabled performance log intermediate commit. Will be refactored and committed to the responsiblity along with documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4919 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 22:18:12 +00:00
hanna 3fc9862964 Unit test fixed - Tribble codecs aren't designed to be stateless, but I was
using one as though it was.  Fixed, and debug code reverted.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 17:47:52 +00:00
hanna b9cb57f4b9 A unit test is failing on bamboo in a way I can't reproduce (or even explain).
Checking in some debugging info.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 16:35:04 +00:00
hanna cba18116e4 A significant refactoring of the ROD system, done largely to simplify the process of
streaming/piping VCFs into the GATK.  Notable changes:
- Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build 
  RMDTracks and lookup codecs.
- RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now
  manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class
  are the walkers that feel the need to access the ROD system directly.  (We need to stamp out
  this access pattern.
A few minor warts were introduced as part of this process, labeled with TODOs.  These'll be
fixed as part of the VCF streaming project.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:52:22 +00:00
ebanks d70483c50a Automatically filter out reads with consecutive indel operators in the CIGAR string
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4914 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:42:54 +00:00
ebanks 848977678d No reason to convert the GLs to a String for formatting when they're just going to be converted to PLs later. That was 5% of the UG runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4913 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 22:06:19 +00:00
aaron 85f2968104 add convenience methods for RODs-for-reads: the ability to get all the RODs covering the read, regardless of their type or position on the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4912 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 20:46:03 +00:00
depristo d7e74f8be6 Temporary phasing evalution walker that needs to be incorporated into the newest VariantEval, whenever it is available
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4911 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 20:43:15 +00:00
ebanks a31f6e4e99 Need to check isBiallelic before calling getSNPSubstitutionType for the allele swap warning
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4909 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 20:17:14 +00:00
ebanks 8a0c07b865 Support for indels in hapmap. This was non-trivial because not only does hapmap not tell you whether the allele is an insertion or deletion, but it also has a completely different positioning strategy (rightmost base). I'll send out an email tomorrow when the new HapMap3.3 VCF is ready.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4908 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 07:37:46 +00:00
chartl 6ebf5b30de Transposing the table, and fixing some null pointer exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4906 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 16:22:57 +00:00
ebanks cebfd01857 Properly output .bed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4905 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 14:49:24 +00:00
depristo 464d0e18e3 Bringing us back to passing integrationtests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4904 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 14:36:11 +00:00
depristo 8c583ea405 RBP now operates correctly at non-variant sites so we can phase hom-ref genotypes with -sampleToPhase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4903 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 13:11:22 +00:00
delangel 376bc563d4 Trivial change to allow GenerateVariantClusters to be run on indels - not that VQSR now works on indels, far from it, but at least it's a first step and it allows us to generate cluster plots to see how well known/novel sites differentiate in their covariates (short answer: no difference/separation :( ).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4902 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 22:39:09 +00:00
hanna e313eeede8 Push command-line expansions, such as BAM list unpacking and -B tag parsing, out
into the CommandLine* classes.  This makes it easier for external functionality
(such as the VCF streamer) to use GenomeAnalysisEngine directly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 19:00:17 +00:00
depristo 66cca7de0f renamed genotypesArePhased to isPhased, as the previous name was incorrect for several reasons. Added setPhase() to MutableGenotype. Other classes changed to reflect renaming to isPhased(). CombineVariants now supports an experimental MASTER mode where it consumes -B:master,vcf and -B:xi,vcf for any number i and updates the master with phasing information in xi.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4896 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:42:05 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
delangel a1653f0c83 Another major redo for indel genotyper: this time, add ability to do allele and variant discovery, and don't rely necessarily on external vcf's to provide candidate variants and alleles (e.g. by using IndelGenotyperV2). This has two major advantages: speed, and more fine-grained control of discovery process. Code is still under test and analysis but this version should be hopefully stable.
Ability to genotype candidate variants from input vcf is retained and can be turned on by command line argument but is disabled by default. 
Code, by default, will build a consensus of the most common indel event at a pileup. If that consensus allele has a count bigger than N (=5 by default), we proceed to genotype by computing probabilistic realigmment, AF distribution etc. and possibly emmiting a call.

Needed for this, also added ability to build haplotypes from list of alleles instead of from a variant context.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4893 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 02:38:06 +00:00
hanna 09c7ea879d Merging GenomeAnalysisEngine and AbstractGenomeAnalysisEngine back together.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4889 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 02:09:46 +00:00
depristo b3ac47812c No longer emits records at filtered sites, in sub-sampling mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4883 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:43:50 +00:00
depristo 60880b925f VC utils prune method now will keep genotype attributes as well as info keys. RBP now emits a far reduce (NO INFO, only GT:GQ:PG) records, further reducing size of phasing output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4882 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:33:14 +00:00
depristo 8604335566 Minor improvements to further reduce debugging output. When running in -samplesToPhase mode, now only including the samples to phase in the output VCF, making it very much smaller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4881 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:19:47 +00:00
depristo ff90c24f28 RBP now supports operating on a subset of samples, outputting a much reduced VCF file appropriate for merging later. Also, general optimization to avoid printing enormous amounts of data to logger.debug by using a glocal static variable DEBUG that conditionally allows writing to the variable. Passes integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4880 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:03:28 +00:00
depristo a3729bd59c Now I call BeforeMethod correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:45:45 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 5dd0e8388b Fixed a bug in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4867 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 19:44:35 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
handsake 21dc05138a Bug fixes for the bwa aligner and changes to support compiling against newer releases of the bwa code base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4863 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 14:49:15 +00:00
chartl 2bd2667516 Another privately-owned class to add before re-checking out repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4858 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:14:51 +00:00
chartl e406eb0f95 Adding a useful accessor method to TableFeature
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4856 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:11:51 +00:00
ebanks 8ab4704b4c Adding a command-line argument to allow missing values to evaluate as false instead of true
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4854 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 05:18:12 +00:00
ebanks 9f3e56e487 VariantAnnotator shouldn't die when multiple records occur at the same position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4853 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 04:05:47 +00:00
hanna acfe83920b '-L unmapped': adding integration tests for explicitly including (-L unmapped)
unmapped reads and explicitly excluding (-XL unmapped) unmapped reads, augmenting
the suite of unit tests already put in place.

'-L unmapped' seems safe to use; go for it, but please validate results against
samtools flagstat when the process finishes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4849 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 23:11:46 +00:00
ebanks dabdeb729e Eric broke the build. Eric broke the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4847 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 17:01:38 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00
chartl 5a27d231fa Rename it so that nobody else falls into the trap laid out (the test is VariantToTable, the walker is Variant[s]ToTable)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4844 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:43:00 +00:00
chartl 5e27e9162f Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:40:27 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
kshakir 01323447c6 Removed LibBat.SUB2_BSUB_BLOCK since the use of it exits the JVM.
Fixed integration tests to wait on their own for the job to run instead of using SUB2_BSUB_BLOCK.
Updated VariantRecalibrationIntegrationTests MD5s which were knocked out of sync whele SUB2_BSUB_BLOCK was exiting in the middle of integration tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4840 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:57:20 +00:00
hanna 67c07d1a6a Fixed recently introduced multiplexer issue where DoC couldn't be written
directly to command-line.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4839 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:35:15 +00:00
hanna 526ae92093 Getting back to '-L unmapped':
- basic unit tests for interval sorting and merging with mix of mapped/unmapped.
- validation to ensure that locus walkers (really all non-read walkers) blow up with a user error when -L unmapped is specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4837 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 18:24:18 +00:00
ebanks afd4655674 Use @Output instead of @Argument. As a side note, Chris I'm ready for this nightmare to go away...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4835 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:13:15 +00:00
ebanks cf7d932a17 Fix for f***ed up BWA alignments that adhere to SAM specs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4834 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:12:25 +00:00
kshakir d550fdfd60 Disabling integration test to see if this restores the full test suite.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4833 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 15:27:02 +00:00
delangel a5008faca8 Bug fix: when getting variant contexts at a site, we need to get only variants that start at current location, otherwise we get duplicated records when filtering indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4830 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:23:10 +00:00
delangel 17db2e0e24 (forgot I hadn't committed this) - refactored IndelStatistics module and added a new inner class to compute Indel classification along with other statistics. So, we now get an extra table specifying, per sample, counts of whether indels are:
- Repeat Expansions
- Novel sequence
And for indels of size <=2 we get a per-mononuc. or dinuc. breakdown of novels and expansions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4828 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:43:43 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
depristo abd6ce1c77 A TiTv-free approach for cutting variants! Apparently much better than previous approach, and will work for indels and SV will truly minor modifications to the code. Will discuss with methods group on Monday.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4822 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 23:08:13 +00:00