Commit Graph

4348 Commits (5dca1e4d2e44d734b95a89e46cb9e416dfd4bae0)

Author SHA1 Message Date
kiran b0432ee1e2 First part of a two-stage commit. Removing old VariantEval to make room for VariantEval 3.0 in core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5137 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 17:03:41 +00:00
ebanks d406d9b3fc There's no reason to special case no-calls if they already have PLs associated with them. Just use the PLs!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5136 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 15:05:45 +00:00
kiran 83dcca7e82 Added ability to load a GATKReport from disk.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5134 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 05:31:49 +00:00
hanna 5e7a5cf924 Quick fix for Danny Lieber: flesh out the additional functionality required
to align to a reference other than what's specified in the header.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5133 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 05:28:37 +00:00
depristo b5d1aab8dc Scripts to create the GATK IAM user and give him/her rights to PutObject (and only PutObject) into the S3 storage instance. Updated the GATKRunReport to now upload using the GATK user, not mark@depristo.com. Running with -et AWS_S3 sends run reports up to the Amazon S3 cloud now. Going to request a few external users try this option so we can see it running at scale. I'm sure S3 can handle a few hundred thousand 1Kb uploads per days, though
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5132 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 03:48:33 +00:00
kiran e26da9b047 Changed column-key names to not have spaces, as GATKReport gets very upset about this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5131 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 03:31:54 +00:00
depristo 197c91e2fb Working implementation of GATKRunReport POSTing to Amazon Web Services S3 storage. Requires users to explicitly provide the secret key to do the upload. Am investigating options to avoid having to do this in the future. Pretty cool little experiment for those who are interested in S3 interaction (extremely trivial)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5130 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-30 21:23:54 +00:00
depristo 8640ca6278 Trivial bug fix so that we don't bring the start up TraversalEngine banner twice when we only process a single locus
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5129 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-30 21:22:16 +00:00
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
scalvo 5934b9cb82 Augment function isChrM by allowing "CRS" in addition to "chrM" or "MT", as a standard contig name indicating the mitochondrial chromosome. CRS stands for Cambridge Reference Sequence and is the standard in the field.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5119 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 22:45:45 +00:00
asivache 7af0532292 An attempt to have more intelligent sorting of RODs. Tested with maf only so far. Should be able to reference-sort dbsnp, bed and vcf as well, bugs nonwithstanding. Very simple, brute-force implementation using SortingCollection. Should I have used tribble indexing machinery instead?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5118 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 22:10:07 +00:00
asivache fa8963522b Ignore header line if it happens to be passed to the codec again, instead of crashing on it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5116 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 21:44:33 +00:00
asivache 8d389e149f Now can deal with input files that contain multiple copies of the same event. Only one assay sequence will be designed for each distinct variant, redundant variants will be discarded. Redundancy is defined as same start, same variant type, same ref and alt alleles (it does not matter, e.g., what the sample was as we do not record sample information anywhere).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5115 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 21:42:29 +00:00
fromer f2de39d661 Calculates phase concordance rates between trio and RBP-phasing tracks, stratified by trio status (Het3, non-Het3)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5114 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 20:50:01 +00:00
fromer ffd5f407a5 Retain only a single walker to perform calculation of haplotype extents
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5110 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 18:33:32 +00:00
depristo 2182b8c7e2 Better query start / stop function that directly parses the cigar string, unlike the previous version. Now properly handles H (hard-clipped) reads. Added -baq OFF and -baq RECALCULATE integration tests on all three 1KG technologies. Please let me know if this new code somehow fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5108 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 15:08:21 +00:00
kiran 9cb1ae384c Constant precision for floating point numbers. Added integration test - carries over tests from VariantEval with the necessary modifications to command-line arguments and md5s. Disabled use of 'synchronized' keyword because I clearly don't get how that keyword is supposed to work yet...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5107 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 05:19:18 +00:00
depristo f29bb0639b Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:40:09 +00:00
asivache f036a178f1 Added support for MAF features. So far works for MAF Lite only, annotated MAF is NOT TESTED yet AT ALL.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5105 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:20:46 +00:00
fromer 91e4bb0285 Added walker to calculate haplotype lengths for ALL fragments produced by stitching together phased sites (actually, stitching together everything BUT unphased het sites)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5104 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:09:20 +00:00
asivache ac3fd567b4 Ugly one-off error fixed in building design sequences for indels: the event position is immediately *before* the event, so the ref base at the current locus is the base immediately *before* [ref/alt] element
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5103 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 02:53:03 +00:00
kiran 3e9f185dad Fixed issue with GenotypeConcordance being initialized incorrectly when the first seen comptrack had no samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5102 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 01:12:27 +00:00
kiran 58f0ecff89 Fixes to support evaluations with TableType elements - each such object now gets a separate entry in the output table. Added codon degeneracy stratification. Handle null elements in reports (useful for debugging).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5101 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 22:09:59 +00:00
hanna a264b16358 Patch from Brett (with minor tweaking by me) to expose all the relationships
of a particular sample in hash format.  Thanks, Brett!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5100 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 21:46:13 +00:00
fromer 9c728979cc In order to calculate haplotype lengths of trio+RBP, I implemented a simple trio phaser as an option to ComparePhasingToTrioPhasingNoRecombination, which already decides if the trio could theoretically phase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5099 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 20:17:48 +00:00
depristo 5ed128f839 Slightly more tolerant timing setting. Main() method in GenomeLocProcessTracker to generating timing data for trackers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5097 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 15:16:07 +00:00
depristo 61c29d550d Fix for NullPointer where a run starts but there's nothing to do (no shards) and reduceInit() wasn't being called correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5096 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 15:15:10 +00:00
kiran 2901299ff6 Sets the number of samples to all of the samples in the file when it's not specifed on the command-line explicitly. GenotypeConcordance no longer a standard evaluation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5094 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 01:38:26 +00:00
fromer 466f8f8a3c Compares RBP phasing to a simple trio phasing model that can phase a child het iff both parental genotypes are known and at least one of them is not het [at EACH of the sites in the pair to be phased]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5092 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 23:43:29 +00:00
asivache 43812a28fc If among all the multiple alignments for the given read we have 'unmapped' ones (can happen with bwa 0.5.7 and maybe later versions), then discard the latters and keep only the mapped ones. Keep 'unmapped' only if its the only alignment available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5090 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 20:07:08 +00:00
asivache 63b709d992 When remapping the read, set MAPQ, CIGAR etc to 0/null for unmapped reads. This is not required according to spec but current samtools jdk otherwise dies in STRICT validation mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5089 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:49:07 +00:00
ebanks d33162145b Moving the --sites_only argument up into the VCFWriter itself so that any walkers that write VCFs can choose not to emit genotypes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5088 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:38:16 +00:00
kiran a97184fddf Frick! Changed to refer to the *playground* version of VariantEvaluator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5087 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:33:03 +00:00
kiran a9d0772516 When evaluating JEXL expressions, on't blow up if the eval VC is null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5085 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 18:25:03 +00:00
kiran 22e599ec76 Fixed output report to properly handle evaluation modules with TableType objects. Promoted CpG to a standard stratification. Demoted Filter to a non-standard stratification. Now, if the filter stratification is not specified, VariantEval only evaluates PASSing sites.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5084 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 17:38:21 +00:00
ebanks 2dcce58279 oneoffs walker to assess GLs at truth sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5083 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:59:05 +00:00
ebanks 0429301536 Added ability to output just sites (no genotypes) from UG with the --sites_only argument. Note that we do still genotype in this mode so that the INFO annotations are identical, but we strip the genotypes out of the VC right before writing to output. In other words, this is not designed to make UG go faster; the point here is to allow downstream tools not to have to parse GTs if they don't want to. Here you go, Ryan.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5081 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:52:38 +00:00
ebanks 01e032e89c Missorted BAMs are User Exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5080 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:09:39 +00:00
depristo be697d96f9 An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 13:45:40 +00:00
hanna 862b299b47 Fix Picard OTF index generation issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5077 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 03:42:46 +00:00
fromer 6ac888d26a Correct accounting for cases where first het in interval is phased
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5075 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:48:54 +00:00
fromer af79fa629f PROPERLY print out list of intervals and their stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5074 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:20:36 +00:00
delangel db2e2cb0ff Another trivial change to make VQSR work with indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5073 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:05:31 +00:00
fromer 17ba75e502 Can now print out list of intervals and their stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5071 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 18:36:59 +00:00
hanna 9db02059ac Fix for Ryan's issue: reads ending with indel distort the location of the
pileup, resulting a two map() calls for the same locus (and no map call for
the locus immediately following).
Fixed bug and added comprehensive unit tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5067 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 19:49:39 +00:00
fromer 61fe409211 Basic walker to count the number of (phased) hets in each exome target
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5064 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 17:53:14 +00:00
depristo c50f39a147 V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 16:45:07 +00:00
delangel fd864e8e3a Minimal necessary (but most likely not sufficient) changes to run VQSR on indel data: don't fill Ti/Tv fields if non-SNP, request VC only st start of position, check if isSNP() before doing snp-specific operations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5062 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 02:36:36 +00:00
depristo a51061fd96 Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
ebanks 2d4bcb60a1 Don't print out alt alleles for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5060 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:33:31 +00:00
ebanks 2ba35dc7ba Bad chain files are user errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5059 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:04:36 +00:00
ebanks 2bbcc9275a Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 05:04:17 +00:00
ebanks bb6999b032 Better documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5057 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 03:36:09 +00:00
depristo c52d2d5f79 Bug fix for SimpleTimer that didn't always convert elapsed times from milliseconds to seconds
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5055 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-22 18:50:59 +00:00
delangel a50d7f74fa Change to support plotting of indel quality as a function of covariates - for now, just call different R calling script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5053 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-22 14:09:23 +00:00
depristo 9b1b8d46aa Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:24:42 +00:00
rpoplin 95d6ddc38c lastProgressPrintTime should only be updated when a progress log is printed not when a performance log is printed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5050 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 22:23:14 +00:00
ebanks 78a43faebe Adding options to warn instead of erroring out (so that you can see all errors in one shot) and to skip filtered records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5042 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 05:24:28 +00:00
ebanks 02b5d4357f Deprecated
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5041 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 05:05:07 +00:00
ebanks c3dbbe7f91 Bug fix: don't assume users won't use arbitrary rods on the commandline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5040 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 04:59:28 +00:00
hanna aea121a9d5 <key>=<value> tagging support for command-line arguments. Unfortunately, still
very hard to validate and still very hard to use (requires core hacking to 
support additional tags).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5038 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 00:22:42 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
depristo e4ac1e6171 Removing unused file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5033 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 13:03:55 +00:00
depristo 85553cf5cb V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain.
See documentation at:

http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29

for examples on how to run this, or the testing Scala script

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:58:13 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
depristo cacdac3914 Major refactoring of shards. No longer uses interfaces but is now an actual object hierarchy with most of the important and common functionality pushed up to base classes. Eliminated a lot of duplicated code, and the shards are much more understandable now. Also now require a GenomeLocParser to work with their own GenomeLocs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5030 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:36:56 +00:00
hanna 7087c2f422 Very simple integration tests for basic VCF streaming functionality.
Rather than try to fork the integration test process to get a pipe source
and sink, creates a new named pipe by Runtime.exec()ing the 'mkfifo' shell
command.  We'll see whether this proves to be a reliable method for testing
streaming.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5028 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 04:38:54 +00:00
kshakir c901fb6d70 Now populating the refseq and dbsnp in awk instead of retrieving from firehose.
Added refseq table to the pipeline object.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5020 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 18:19:10 +00:00
rpoplin 24bc843ae8 Dynamically change the log message update rate so that short jobs receive frequent updates while longer running jobs receive fewer updates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5016 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 15:09:11 +00:00
rpoplin bd2af33a16 misc clean up in VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5014 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 21:04:31 +00:00
rpoplin 00453919d2 VQSR now only uses the valid polymorphic sites for training and truth sensitivity calculations. Any number of tracks whose ROD binding begins with the name truth can be used as truth sensitivity tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5012 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 20:48:19 +00:00
fromer 4bec93e3e4 Permit retrieval of read names for debugging purposes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5011 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 16:09:34 +00:00
depristo f8ba76d87c Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 21:23:09 +00:00
kiran 2f4a436719 Throw an exception if no eval rods are specified. If one or more samples are specified, subset the 'all' VariantContext to just the specified samples. This is useful when you want to see what effect dropping certain samples will have on the metrics and you don't want to go through SelectVariants first.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5009 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 06:46:10 +00:00
ebanks 366c3a0b8f Incompatible chain files are user exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5008 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-16 05:26:47 +00:00
depristo a88708ebfa Moving GLF code to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-15 22:42:42 +00:00
hanna 579e0d59fa Rewrote warning message to discourage use of unsafe mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5003 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 21:32:53 +00:00
hanna af31d02a2d Fix concurrency issue that periodically kills VariantEvalIntegrationTest --
a member field of RMDTrackBuilder was getting rebuilt every time it was
called, creating concurrency issues.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5001 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 18:52:21 +00:00
kiran 73acfa654a Fixed double-counting bug. Fixed issue where evaluation module with an update2() method wasn't getting called if the comp track was null. Added a column to the output report indicating the table name for easy greppability. Fixed an issue where, if sample-level stratification was not required, the sample-level VCs would be generated anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5000 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 14:06:43 +00:00
depristo afbea9ce59 SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 03:14:24 +00:00
hanna bfbf75fe3e Fix error in command-line validation: don't ever allow intervaled access to unindexed read stream, no
matter what type of traversal it is.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4997 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 02:49:04 +00:00
delangel 00310c05bb Fix corner condition that happens when there are indels right at the end of a contig and there's not enough reference to build a haplotype.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4996 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 21:08:22 +00:00
hanna c0031b05ff Stamp out lazy loading in the PluginManager. This is an attempt to stamp
out the non-deterministic VariantEvalIntegrationTest errors we've been seeing.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4995 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 20:58:28 +00:00
fromer b107c97c1a Cannot have "=" sign in reason, so change to ":"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4991 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 17:23:44 +00:00
fromer b4a2112a0d Added the "previous locus" to interesting sites VCF (locus with respect to which the site is phased)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4990 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 17:19:20 +00:00
fromer e8f0ae4b09 Renamed and documented some phasing-specific classes to make their purpose clearer to someone browing through the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4989 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 16:17:36 +00:00
fromer ffae7bf537 Moved phasing-specific utilities to phasing sub-directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4987 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 15:38:20 +00:00
depristo 91824f478e FASTQ directory is gone
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4986 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 15:16:06 +00:00
depristo e3956148ac removing unused fastqtobam
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4985 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:29:32 +00:00
rpoplin ce3d226183 Reverting back to the old definition of QD because it works better with large numbers of samples. The new QD is relegated to a new annotation: sumGLbyD. Tweaks to the new HaplotypeScore based on evaluation with better QD calculation. The default qual threshold in GenerateVariantClusters is updated to be in line with the variant quality scores coming from the exact model.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4984 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:12:30 +00:00
hanna e0092bb160 Experimental feature: change the rate at which log messages appear on-the-fly
and enable/disable performance logs from outside the JVM process.  Making this
available for the moment; we'll see whether it ends up being useful.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4983 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 04:20:53 +00:00
carneiro 9e93091e9a -baqGOP now takes phred scaled scores instead of probabilities in the command line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4982 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 00:06:38 +00:00
hanna 5736d2e2bb Something I should have done a long time ago: attempt to detect whitespace
after the line continuation backslash and enhance the error message if it
appears.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4981 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 23:15:08 +00:00
hanna edebbb5aa0 Fixed long-standing bug reported by Mauricio where @Arguments assigned to
primitive types were properly validated and throw the proper 
MissingArgumentValue UserException.  Before this fix, the error reported
was the infamous DePristo BSOD (Could not create module String because 
an exception of type NullPointerException occurred caused by exception null).

Thanks Mauricio!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4980 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 22:18:24 +00:00
hanna 6d855041ec Oops...forgot to commit the changes that allow primitive VCF streaming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4979 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 21:54:51 +00:00
delangel 8a6b126ea8 Several cleanups to IndelMetricsByAC:
- No longer a standard eval module to keep integration tests happy
- Remove class name overlaps with SimpleMetricsByAC so that modules don't overwrite each other's files, and to make it easier to grep results.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4978 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:35:24 +00:00
depristo 8fe5641b2e can explicitly set the now required ReferenceDataSource in unit tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4977 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 18:25:12 +00:00
depristo 468ef382b7 vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 17:32:27 +00:00
delangel bdd382198c Necessary changes to enable HaplotypeScore annotation for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4974 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:09:12 +00:00
delangel 23597a2bde Variant Eval module that collects indel statistics (basic counts and event sizes) and partitions by AC (similar to SimpleMetricsByAC in the SNP case)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4973 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-12 01:08:09 +00:00
fromer 48052907a6 A hom genotype can always be considered phased
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4972 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-11 18:48:48 +00:00
fromer c2dd956888 Moved PrintReferenceVariantsWalker to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4971 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 22:07:41 +00:00
kshakir 8ba3a5a43f Command lines for locally run Queue jobs no longer have to be escaped differently than bsub'ed jobs.
GSA-410 Local job runs now can run command lines longer than than 4096 on our linux machines.
When determining if the help text and Queue extensions need to be rebuilt, use the .class files not the .java so that GATK oneoffs are picked up correctly.
Added the most basic of all example QScripts for debugging, Hello World.
Minor updates to copy/pasted LSF code to reduce ant javadoc warnings by a third.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4970 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 21:07:29 +00:00
ebanks ee348ac9d4 Add a hidden mode to the realigner to turn off SW but still use indels other than known ones (i.e. those already in the reads)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4969 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 20:27:04 +00:00
fromer 01c2091cd9 A LocusWalker to print the haploid reference genome as a VCF file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4968 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 16:59:41 +00:00
delangel 9648399630 Boneheaded silly bug in indel caller - posterior probability computation was using priors gotten from SNP heterozygosity, not indel heterozygosity. Added then indel het. argument to command line and hook it up (not a radical change in calls though, just a few dubious calls around the edges fall off)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4967 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 14:56:28 +00:00
aaron b24e1134f9 unfortunately samrecord pileup also uses zero length intervals to indicate deletions; this will have to be a BED specific exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4964 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:32:50 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
ebanks 60f45a7c49 Stupid me. Forgot to put this check in the last commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4959 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:16:41 +00:00
aaron 56b87da8f9 a better error message for the situation where a RMD track generates a negitive length interval; the user will now see a message like "Bad input: A feature produced by the reference metadata track named "bed" at position chr1:10434-10433 has a start greater than the stop; this is an invalid position "
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4958 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 19:06:04 +00:00
ebanks 4272b824d6 unused imports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4957 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:33:12 +00:00
chartl 3e7802a3e0 Minor changes to a qscript and the GQ constants on PrivatePermutations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4956 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:26:21 +00:00
kiran 79fcff13ff Fixed import statement that was erroneously referring to VE3 rather than VE2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4955 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 03:22:25 +00:00
ebanks f3ca2cc9de Add safety net to BAQ calculation: explicitly cast to byte/int and check for bad values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4954 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 18:09:12 +00:00
ebanks 2ac5c52281 Better error message as per Mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4953 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:44:02 +00:00
ebanks e0d091b3db Die gracefully if the bam is malformed with quals that are too high
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4952 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:39:08 +00:00
kiran 3163970ad5 Updates that slipped from my last commit: fixed some imports and calls to super().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4951 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:34:40 +00:00
kiran d88fd7212f Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4948 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:27:19 +00:00
kiran 307c41c128 Changes to allow the primary key of a table to be hidden. Formatting changes to account for when that column is hidden.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4947 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:26:38 +00:00
kiran fdc514ded3 Intermediate commit for VariantEval 3.0. Among the changes:
* Stratifications (by comp rod, by eval rod, novelty, filter status, etc.) have been generalized.  They are very symmetric with evaluators now.  Each stratification can have multiple states (e.g. known, novel, all).  New stratifications can be added and optionally applied.  Some new stratifications include:
  - by sample
  - by functional class
  - by CpG status

* Output is to a single file in GATKReport format, rather than having the options of CSV, R, table, etc.

* Rather than needing to state up front that the allowable variant type is a SNP or an indel, each eval record is inspected and the appropriate record type is fetched from the comp track.  (This will require a bit more testing...)

* Evaluation context (basically a single row in a VariantEval report) generation and retrieval has been overhauled.  Now, every possible configuration of stratification state is generated recursively and stored in a HashMap.  The key of the HashMap is a key that represents that exact state configuration.  When examining a comp track and eval track, this key is computed based on the data, providing easy lookup for the appropriate evaluation context.  When there are only a handful of stratification configurations, this isn't a big deal.  But when operating on a file with hundreds of samples, multipled by 3 states for novelty, 3 states for filtration, 3 states for CpG status, etc., it becomes a very big deal.

There are still some known issues:
* When the per-sample stratification is turned off, things are getting overcounted (too many variants are showing up when compared to the VariantEval 2.0 code).  It's probably because I break out the VariantContext by sample even when not necessary, and those irrelevant contexts are still being counted.  Or my recursion is overaggressively creating evaluation contexts, and they all get added up in a weird way.  But that's why I'm committing now - so I can track down this issue without losing my work so far.

* The Jexl expressions are sometimes throwing an exception that I don't yet understand (they complain of an incorrect specification on the command-line... *after* the program has made it through a few thousand records.

* The request to have evaluations be smart enough to reject certain stratification states is not implemented yet.

There's still some work to do before I can replace VariantEval 2.0 with VariantEval 3.0, but feel free to take a look.  I'd love comments on the new code.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4946 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:20:24 +00:00
kiran e9201b81d1 A more general method for specifying samples to act on from the command-line. Supports samples specified individually on the console, a file of samples, or regular expressions to select multiple samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4945 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 14:54:56 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
aaron cba436fa2f small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 21:20:26 +00:00
fromer 4b37710bcd Added validator for phasing using read information, e.g., PacBio: ReadBasedPhasingValidationWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4940 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:05:56 +00:00
delangel d203f5e39a Experimental change in how we classify indels - up to now, an indel of say AA was counted as a 2-mer repeat expansion. But in reality, if the event is sounded by A's it's really a multiple monomer expansion. So, we first reduce the indel bases in case they are made of repeated elements before classifying them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4939 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 17:13:18 +00:00
rpoplin 4ac0590744 Fix for NaNs in the rank sum tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4938 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 15:21:30 +00:00
chartl 445ae06a7a Re-add PrivatePermutations since ACTransitionTable is a little too memory-intensive to generate all the cuts that I need
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4937 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 06:11:18 +00:00
hanna 7cdaffbe5c Create tmpdir if it doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4936 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 03:07:11 +00:00
hanna 0982d35f5b Bug fixes in streaming in Tribble data via /dev/stdin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 02:43:04 +00:00
rpoplin 23dbc5ccf3 HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 00:28:05 +00:00
ebanks 85714621be Better interface to Genotypelikelihoods class. Now you need to specify the format (GL vs PL) of the output string when calling getAsString(). All likelihoods are represented as GLs internally. QualByDepth no longer does its own conversion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4933 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 21:48:14 +00:00
ebanks 96729acd0d Optional argument to put the original position into the INFO field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4930 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 19:22:44 +00:00
delangel caedfed860 Fix bug where indels being incorrectly classified in VariantEval module
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4929 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-04 18:01:48 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
depristo d31c658c2e Organized performance monitoring passes unit tests and is more efficient
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4924 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:09:08 +00:00
depristo c51e745bae The engine can be null in a unit test, so check for it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4923 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 01:00:52 +00:00
depristo 75a7d8a76e Trivial formatting error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4922 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 23:44:36 +00:00
depristo 5539c2d9f3 --performanceLog (-PF) X.dat argument now enabled. Writes out a table (R-friendly) of the performance of the GATK over time, exactly as a more detailed version of the INFO progress meter. R script for useful plotting of the performance of the GATK over time. Will be helpful for upcoming scalability testing and debugging of memory leaks and other incremental performance problems
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4921 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 23:34:21 +00:00
depristo 4c9746f463 Disabled performance log intermediate commit. Will be refactored and committed to the responsiblity along with documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4919 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-02 22:18:12 +00:00
hanna 3fc9862964 Unit test fixed - Tribble codecs aren't designed to be stateless, but I was
using one as though it was.  Fixed, and debug code reverted.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 17:47:52 +00:00
hanna b9cb57f4b9 A unit test is failing on bamboo in a way I can't reproduce (or even explain).
Checking in some debugging info.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 16:35:04 +00:00
hanna cba18116e4 A significant refactoring of the ROD system, done largely to simplify the process of
streaming/piping VCFs into the GATK.  Notable changes:
- Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build 
  RMDTracks and lookup codecs.
- RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now
  manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class
  are the walkers that feel the need to access the ROD system directly.  (We need to stamp out
  this access pattern.
A few minor warts were introduced as part of this process, labeled with TODOs.  These'll be
fixed as part of the VCF streaming project.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:52:22 +00:00
ebanks d70483c50a Automatically filter out reads with consecutive indel operators in the CIGAR string
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4914 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:42:54 +00:00
ebanks 848977678d No reason to convert the GLs to a String for formatting when they're just going to be converted to PLs later. That was 5% of the UG runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4913 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 22:06:19 +00:00
aaron 85f2968104 add convenience methods for RODs-for-reads: the ability to get all the RODs covering the read, regardless of their type or position on the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4912 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 20:46:03 +00:00
depristo d7e74f8be6 Temporary phasing evalution walker that needs to be incorporated into the newest VariantEval, whenever it is available
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4911 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-29 20:43:15 +00:00
ebanks a31f6e4e99 Need to check isBiallelic before calling getSNPSubstitutionType for the allele swap warning
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4909 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 20:17:14 +00:00
ebanks 8a0c07b865 Support for indels in hapmap. This was non-trivial because not only does hapmap not tell you whether the allele is an insertion or deletion, but it also has a completely different positioning strategy (rightmost base). I'll send out an email tomorrow when the new HapMap3.3 VCF is ready.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4908 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-27 07:37:46 +00:00
chartl 6ebf5b30de Transposing the table, and fixing some null pointer exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4906 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 16:22:57 +00:00
ebanks cebfd01857 Properly output .bed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4905 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 14:49:24 +00:00
depristo 464d0e18e3 Bringing us back to passing integrationtests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4904 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 14:36:11 +00:00
depristo 8c583ea405 RBP now operates correctly at non-variant sites so we can phase hom-ref genotypes with -sampleToPhase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4903 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-23 13:11:22 +00:00
delangel 376bc563d4 Trivial change to allow GenerateVariantClusters to be run on indels - not that VQSR now works on indels, far from it, but at least it's a first step and it allows us to generate cluster plots to see how well known/novel sites differentiate in their covariates (short answer: no difference/separation :( ).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4902 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 22:39:09 +00:00
hanna e313eeede8 Push command-line expansions, such as BAM list unpacking and -B tag parsing, out
into the CommandLine* classes.  This makes it easier for external functionality
(such as the VCF streamer) to use GenomeAnalysisEngine directly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 19:00:17 +00:00
depristo 66cca7de0f renamed genotypesArePhased to isPhased, as the previous name was incorrect for several reasons. Added setPhase() to MutableGenotype. Other classes changed to reflect renaming to isPhased(). CombineVariants now supports an experimental MASTER mode where it consumes -B:master,vcf and -B:xi,vcf for any number i and updates the master with phasing information in xi.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4896 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:42:05 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
delangel a1653f0c83 Another major redo for indel genotyper: this time, add ability to do allele and variant discovery, and don't rely necessarily on external vcf's to provide candidate variants and alleles (e.g. by using IndelGenotyperV2). This has two major advantages: speed, and more fine-grained control of discovery process. Code is still under test and analysis but this version should be hopefully stable.
Ability to genotype candidate variants from input vcf is retained and can be turned on by command line argument but is disabled by default. 
Code, by default, will build a consensus of the most common indel event at a pileup. If that consensus allele has a count bigger than N (=5 by default), we proceed to genotype by computing probabilistic realigmment, AF distribution etc. and possibly emmiting a call.

Needed for this, also added ability to build haplotypes from list of alleles instead of from a variant context.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4893 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 02:38:06 +00:00
hanna 09c7ea879d Merging GenomeAnalysisEngine and AbstractGenomeAnalysisEngine back together.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4889 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 02:09:46 +00:00
depristo b3ac47812c No longer emits records at filtered sites, in sub-sampling mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4883 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:43:50 +00:00
depristo 60880b925f VC utils prune method now will keep genotype attributes as well as info keys. RBP now emits a far reduce (NO INFO, only GT:GQ:PG) records, further reducing size of phasing output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4882 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:33:14 +00:00
depristo 8604335566 Minor improvements to further reduce debugging output. When running in -samplesToPhase mode, now only including the samples to phase in the output VCF, making it very much smaller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4881 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:19:47 +00:00
depristo ff90c24f28 RBP now supports operating on a subset of samples, outputting a much reduced VCF file appropriate for merging later. Also, general optimization to avoid printing enormous amounts of data to logger.debug by using a glocal static variable DEBUG that conditionally allows writing to the variable. Passes integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4880 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:03:28 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
handsake 21dc05138a Bug fixes for the bwa aligner and changes to support compiling against newer releases of the bwa code base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4863 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 14:49:15 +00:00
chartl 2bd2667516 Another privately-owned class to add before re-checking out repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4858 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:14:51 +00:00
chartl e406eb0f95 Adding a useful accessor method to TableFeature
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4856 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:11:51 +00:00
ebanks 8ab4704b4c Adding a command-line argument to allow missing values to evaluate as false instead of true
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4854 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 05:18:12 +00:00
ebanks 9f3e56e487 VariantAnnotator shouldn't die when multiple records occur at the same position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4853 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 04:05:47 +00:00
ebanks dabdeb729e Eric broke the build. Eric broke the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4847 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 17:01:38 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00
chartl 5e27e9162f Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:40:27 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
kshakir 01323447c6 Removed LibBat.SUB2_BSUB_BLOCK since the use of it exits the JVM.
Fixed integration tests to wait on their own for the job to run instead of using SUB2_BSUB_BLOCK.
Updated VariantRecalibrationIntegrationTests MD5s which were knocked out of sync whele SUB2_BSUB_BLOCK was exiting in the middle of integration tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4840 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:57:20 +00:00
hanna 67c07d1a6a Fixed recently introduced multiplexer issue where DoC couldn't be written
directly to command-line.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4839 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:35:15 +00:00
hanna 526ae92093 Getting back to '-L unmapped':
- basic unit tests for interval sorting and merging with mix of mapped/unmapped.
- validation to ensure that locus walkers (really all non-read walkers) blow up with a user error when -L unmapped is specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4837 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 18:24:18 +00:00
ebanks afd4655674 Use @Output instead of @Argument. As a side note, Chris I'm ready for this nightmare to go away...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4835 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:13:15 +00:00
ebanks cf7d932a17 Fix for f***ed up BWA alignments that adhere to SAM specs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4834 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:12:25 +00:00
delangel a5008faca8 Bug fix: when getting variant contexts at a site, we need to get only variants that start at current location, otherwise we get duplicated records when filtering indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4830 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:23:10 +00:00
delangel 17db2e0e24 (forgot I hadn't committed this) - refactored IndelStatistics module and added a new inner class to compute Indel classification along with other statistics. So, we now get an extra table specifying, per sample, counts of whether indels are:
- Repeat Expansions
- Novel sequence
And for indels of size <=2 we get a per-mononuc. or dinuc. breakdown of novels and expansions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4828 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:43:43 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
depristo abd6ce1c77 A TiTv-free approach for cutting variants! Apparently much better than previous approach, and will work for indels and SV will truly minor modifications to the code. Will discuss with methods group on Monday.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4822 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 23:08:13 +00:00
kshakir 895cb39f41 Thanks to Platform Computing tech support, found the magical environment variable BSUB_QUIET.
Minor refactoring to add more of the CLibrary including setenv().

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4819 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 21:27:12 +00:00
depristo 5b46a900b3 Final version of BAQ calculation. default gap open is 1e-4, a good sensitive value. Useful timer class SimpleTimer added. BAQ is now live.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4818 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 19:35:12 +00:00
ebanks 491a599b59 Minor optimization
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4817 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 18:56:35 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
hanna d4d3170436 Support for '-L unmapped' in read walkers. DO NOT USE THIS PATCH YET. It has been
subjected to and passes cursory testing on one dataset (and all integration tests pass).
However, there's a small library of validation checks, and unit and integration tests 
that must be added.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4813 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:51:48 +00:00
delangel a2d6cef181 Weird corner condition fix in indel genotyper: if there are 2 consecutive locations on candidate sites to genotype, we can get both when calling getVariantContexts and if we are triggering on an extended event - this leads to confusion and we can end up picking the wrong one. So, we require start of the vc to be the same as the start of the ref locus to be sure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4812 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:34:23 +00:00
depristo 722819688a Minor utility improvements to ValidateBAQ
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4809 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 02:19:32 +00:00
depristo a63bbb2fec Optimized BAQ implementation. No longer does excessive amounts of copying of arrays. At this point I'm not 100% certain where additional performance improvements would come from
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4808 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 21:26:30 +00:00
depristo db55b2b0c6 Better testing of BAQ. Now really handles soft clipped reads properly by doing an expensive copy operation :-( will need to be transformed to a ByteBuffer in the near future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4807 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:37:00 +00:00
depristo 16e1bbd380 Hidden command line option to control BAQ gap open penalty for testing by me and eric. ValidateBAQWalker has misc. useful improvements. PrintReads now adds BAQ tags on output, if requested.
BAQ has generally useful improvements.  Refactor code to make it easier for BAQUnitTest to run.  minBaseQuality enforced on output, as well as input now.  Added BAQUnitTest that checks that the BAQ calculation is performing as expected.  Still needs to be expanded significantly.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4804 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 01:01:39 +00:00
depristo 1b6bec8e6b Trivial changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4803 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 20:06:54 +00:00
delangel ca7810f11d First major update of indel genotyper:
a) Really fix this time strand bias computation for indels, previous version was a partial fix only.
b) Change way in which we deal with bad bases at the edge of reads. Even if a base is soft clipped in CIGAR string, there may still be dangling bases with Q=2 that may throw off QUAL computation in some sites. So, we're stricter and we also trim off those bases off read edges even if they are not soft-clipped officially.
c) First feeble-minded attempt at runtime optimization - don't compute log and 10^base_qual every time. Rather, cache 10^-k/10 and log(1-10^-k/10) for all k <=60. This speeds up code about 4x.
d) Further optimization: don't compute log(10^x+10^y) but rather use softMax function recently put into ExactAFCalculationModel.
e) Skip bad reads where all Q=2 (sic)
f) Avoid log to lin and back to log conversions of genotype likelihoods - this was legacy code from back when exact model did stuff in linear domain. This improves precision overall.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4802 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 18:35:22 +00:00
ebanks e2d45ec2af Make Indel Realigner exceptions related to not enough space on disk or a too low file-handle limit UserExceptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4801 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 16:37:31 +00:00
depristo 70980b659a CombineVariants no longer requires rod_priority_string
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4800 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 15:39:43 +00:00
depristo bc885b7bd0 Don't print debugging output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4799 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:57:11 +00:00
depristo c91712bd59 BAQ calculation refactoring in the GATK. Single -baq argument can be NONE, CALCULATE_AS_NECESSARY, and RECALCULATE. Walkers can control bia the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
SAMFileWriterStub now supports BAQ writing as an internal feature.  Several walkers have the @BAQMode applied to this, with parameters that I think are reasonable.  Please look if you own these walkers, though

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4798 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:55:52 +00:00
depristo 5d2c2bd280 Just refactoring into utils/baq directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:43:43 +00:00
depristo 80f32712dc Tiny bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4793 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:48:33 +00:00
depristo 44feb4a362 Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:29:39 +00:00
ebanks 8901e63879 Cheap optimization: don't keep calculating the log of a constant. (How did I not catch this before?)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4791 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 04:36:21 +00:00
ebanks bef48e7a42 For Chris, to make his life easier: iterate over all VCF records passed in looking for one with an ALT allele defined instead of assuming all records have one.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4789 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:23:38 +00:00
depristo 97c94176c0 Immediate, obvious bug fix to avoid blowing up on unmapped reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4788 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:43:39 +00:00
depristo a5b3aac864 Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading *lots* of reference data all of the time
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:23:06 +00:00
fromer b12cec4302 Added emitOnlyMNPs flag
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4785 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 20:34:17 +00:00
fromer 6d4ec7f9e7 Remove RefSeq INFO from MNPs since annotating them properly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4784 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 19:03:35 +00:00
fromer 4719bbc772 Changed dontRequireSomeSampleHasDoubleAltAllele parameter to mean that merging should only start at a polymorphic site
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4783 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 17:52:56 +00:00
ebanks ec174dc0ba As per Menachem's last commit, there's a minimally more efficient way of doing the MQ cap.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4782 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:37:08 +00:00
fromer 92cf7744a6 Set minMQ = max(minMQ, minBQ) for phasing since anyway we cap BQ by MQ; also, lowered MIN_BASE_QUALITY_SCORE for phasing to 17 (was previously 20)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4781 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:31:13 +00:00
ebanks 237ab1d489 1. As discussed in group meeting today, because we cap BQ by MQ, if MQ < minBQ then we filter the read.
2. Update to UGCalcLikelihoods for Chris: require a vcf bound to 'allele' to be provided so that we know exactly which alternate allele we should be calculating GLs for at each site.  The user is warned when the VC is not biallelic or there are multiple records at a site.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4780 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 05:57:06 +00:00
delangel da6a07ad3b First round of critical fixes to indel genotyper (more to come tomorrow):
a) Avoid complete crash of caller that broke due to a recent refactoring by someone who must not be named <cough>EB<cough>... an integration test to avoid this in the future coming soon.
b) Fixed up strand bias computation for indels





git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4779 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 02:48:09 +00:00
fromer e09d6ee56b write non-MNP VariantContexts records only once (where they start)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4777 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:14:26 +00:00
fromer 1515bf6de9 Merged common VCF writing logic into phasing/WriteVCF.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4776 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:03:02 +00:00
asivache 4e62de4213 Added method getOriginalReadGroupId(): takes merged (in case of collision) read group id as reported by a read coming from the merged stream and returns this read's read group id as it was listed in the original input bam file.
IndelRealigner now uses this functionality to correctly un-mangle read group id's in --nWayOut mode (i.e. when we need to write reads into separate output bams with headers matching the original inputs).

Some hidden changes to IndelRealigner: purely testing and development, transparent to the users (hidden option added)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4775 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 21:41:52 +00:00
rpoplin e5282742f9 Bug fix in CountCovariates, skip over indel records as well as SNPs in the dbsnp file. CountCovariates is now called CountCovariatesWalker. I've always hated that the name was swapped.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4774 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 18:43:24 +00:00
rpoplin 0adf505b53 We no longer look at by-hapmap validation status in the VQSR because using the HapMap VCF file is higher quality. As a side effect we now support the dbsnp 132 vcf file. ApplyVariantCuts now requires that the input VCF rod bindings begin with input, matching the other VQSR walkers. Wiki updated with information about how to obtain the hapmap and 1kg truth sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4772 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 15:38:45 +00:00
ebanks 99b942b0b4 Removing duplicated header args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4770 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 20:16:53 +00:00
fromer 9ac0f98d0d Fixed bug in retaining proper RefSeq records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4768 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:39:02 +00:00
ebanks 7caf666f48 For Sendu: add a hidden option to allow bams to come out unsorted. We've agreed to let him deal with sorting these puppies on his own.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4767 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:56:13 +00:00
ebanks 3afa841a6a Fixing docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4766 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:36:47 +00:00
ebanks 6a6cdc1925 Adding minor usage docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4765 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:34:33 +00:00
ebanks 0d1c905df3 Adding UGCalcLikelihoods and UGCallVariants so that GSA members can break up the calling process into separate steps (calculate the GLs and then call off of those) - useful for Chris's new batch merger. As the docs say, these are absolutely not supported or recommended for public use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4764 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:32:26 +00:00
fromer b4ef716aaf As per Eric and Mark's suggestions, separated the segregating MNP merger (MergeMNPs) from the more general merger employed for annotation purposes (MergeSegregatingAlternateAlleles). Both use the same core MergePhasedSegregatingAlternateAllelesVCFWriter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4763 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 16:42:08 +00:00
ebanks 0892daddb0 Improvement for the TGEN folks: when running in the solid recal mode of SET_Q_ZERO_BASE_N, update the NM tag if one was present in the read to reflect the new N's in the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4761 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 04:36:44 +00:00
asivache a22b1b04e6 SW-turbo. Kind of. This implementation is presumably equivalent to the old one (mathematically), but runs ~10 times faster: inner loops eliminated completely. The author of the original implementation should be sentenced to the galleys. Oh, that would be me...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4760 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 00:08:47 +00:00
delangel 2ac938fe4e 1)
Minor fixes to avoid crashes vs CG indel files:
- Add count for complex events, not just insertions and deletions
- Handle correctly cases of large indels falling out of bounds of histogram array: added a count of indels ouf of bounds and avoid exceptions.

2) Cosmetic fix for R script assessing UG calling performance: draw red y=x line on top of Simulated vs Estimated AC to get a better view of under/over-estimation of AC.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4758 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 21:08:25 +00:00
rpoplin af84462f3e The dev team has decided to change the filter that is added to records that are set to monomorphic by Beagle. It no longer lists the reference allele. Added those filters to the header of the output VCF file. Finally, we no longer use R2=NaN values coming from Beagle.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4757 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 17:19:54 +00:00
ebanks 21256909bb Not supported. I'm checking this in for Ryan only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4756 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 16:59:18 +00:00
kshakir e21a66d876 Updated the Queue GATK generator and packaging to include more dependencies for fullCallingPipeline.q.
Set the -bigMemQueue in the FullCallingPipelineTest to GSA to avoid waiting for the week queue when it is busy.
Fixed the package definition of PipelineTest so that scalac won't recompile it every time.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4755 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 15:29:40 +00:00
aaron 7f2ded0706 belated special case fix for Menachem; if the results of a BTI and BTIMR produce an empty interval list, exception out. This would be solved long term with better handling or empty and / or null interval lists. I'll add a JIRA
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4754 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 05:49:20 +00:00
ebanks a181680814 We no longer require dbSNP files to be of the dbsnp rod-type; VCFs will do (provided they are bound to the name 'dbsnp')
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4753 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 03:25:18 +00:00
asivache 8ffea42b75 about 10% improvement in SW alignment (and hence IndelRealigner!) speed by using c-style linearized array representation for matrices instead of java 2D arrays...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4751 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 00:06:50 +00:00
aaron b03ac61e9d consolidating the checking of the RMD sequence dictionary against the reference into a single function, and adding an integration test to test that empty VCFs pass (both the indexing and the seq dictionary validation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4750 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 00:01:56 +00:00
hanna abc13d0a90 Temporary hack: force abort with an intelligent message suggesting that users
specify -B:dbsnp,vcf <filename> if the filename passed if the --DBSNP argument
value contains 'vcf'.  We'll replace this functionality once dbSNP 132 starts
playing nicely with the tagging system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4749 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 23:37:30 +00:00
ebanks d89e17ec8c Fare thee well, UGv1. Here come the days UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:51:19 +00:00
fromer 727dac7b7a Added MNP annotation of the number of AA changes occuring in the SAME RefSeq entry (numAAchanges), and if this number is > 1 for any of the alt alleles (alleleHasMultAAchanges)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4746 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:24:30 +00:00
ebanks 222cd42ceb Have the UG engine take care of the GL to PL conversion. Note that we still use GLs for calling (since we are losing precision in high-pass and, even worse, it can affect QD), but we emit PLs in all cases. This means that calculating the GLs, emitting them to VCF, and then calling off of them (a la samtools) is absolutely, positively not ideal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4745 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 20:28:16 +00:00
ebanks 102c8b1f59 Large refactoring of the UGv2 engine so that it is now truly separated into 2 distict phases: GL calculation and AF calculation, where each can be done independently. This is not yet enabled in UGv2 itself though because I need to work out one last issue or two. Tested on 1Mb of 1000G Aug allPops low-pass and results are identical as before. Also, making BQ capping by MQ mandatory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4744 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-28 21:36:33 +00:00
ebanks ce051e4e9a Write to sdout when no -o is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4743 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-28 06:19:26 +00:00
ebanks e3e6d176df Looking over the daily error log email made me realize that there were 2 implementations of vc.modifyLocation() - the correct one in VC that didn't require lazy loading the genotype data and the bad one in VCUtils that did. Removing the implementation in VCUtils and updating the code accordingly. Also, removing createPotentiallyInvalidGenomeLoc() since no one uses it anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4736 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 18:40:34 +00:00
ebanks 35b90d2295 Don't compute SB for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4735 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 03:54:26 +00:00
ebanks 6934f83cc7 Two changes to CombineVariants.
1. Fix: VCs were padded before the merge, but they were never unpadded afterwards.  This leaves us with a VC that doesn't meet our spec.
2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-25 04:52:12 +00:00
fromer d775192631 Check if MNP annotation of amino acid is dependent on the MNP, or could it be obtained through some single-base variant?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4733 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 22:38:33 +00:00
rpoplin 0dd40c3684 Updating doc text
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4732 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 21:34:14 +00:00
rpoplin ed08899abc Overwhelming evidence that maxQ = 50 is now a better default than maxQ = 40 in the base quality score recalibrator, especially when combined with dbsnp build 132. Also, added option in ProduceBeagleInputWalker for Beagle-ing chromosome X calls with male samples which sets the genotype likelihood for the AB allele to zero for those samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4731 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 21:32:26 +00:00
fromer ca70ed611c Totally revamped the MNP annotation and put it in its own walker: AnnotateMNPsWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4730 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 18:05:10 +00:00
depristo 8768e1a240 Useful profiling tool that reads in a single rod and evalutes the time it takes to read the file by byte, by line, into pieces, just the sites of the vcf, and finally the full vcf. Emits a useful table for plotting with the associated R script that can be run like Rscript R/analyzeRodProfile.R table.txt table.pdf titleString
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4728 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 14:59:16 +00:00
ebanks 7a8b85dd15 Catch the JEXL exception when trying to match a variable that's not in the context - and don't filter in these cases. Now everyone can happily go back to using the stupid (and hopefully temporary) AlleleBalance filter.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4727 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 05:00:41 +00:00
ebanks caf2c21f61 Must close the writer to flush the cache
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4726 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 04:33:08 +00:00
ebanks 816c33c821 indel-related fixes to the strict validator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4725 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 04:08:34 +00:00
delangel 9cdc341be5 Trivial update for data processing paper: change syntax of output argument for Beagle by depth walker to update to new GATK format.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4724 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 01:45:44 +00:00
ebanks ea6e2218c1 1. dbsnp has some massive indels which my left-aligner was barfing on because there isn't enough reference context; fixed. 2. Lower default calling threshold to Q30 for UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4722 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 19:28:33 +00:00
aaron 53672361cc capture more details when something IO-related goes wrong in writing a Tribble index
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4720 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 17:06:28 +00:00
hanna 082073ca3c Stop RBP.getPileupBySample() from throwing a NullPointerException if the
sample doesn't exist -- now returns null.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4719 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 05:17:06 +00:00
kshakir 787e5d85e9 Added the ability to test pipelines in dry or live mode via 'ant pipelinetest' and 'ant pipelinetest -Dpipeline.run=run'.
Added an initial test for genotyping chr20 on ten 1000G bams.
Since tribble needs logging support too, for now setting the logging level and appending the console logger to the root logger, not just to "org.broadinstitute.sting".
Updated IntervalUtilsUnitTest to output to a temp directory and not the SVN controlled testdata directory.
Added refseq tables and dbsnps to validation data in BaseTest.
Now waiting up to two minutes for gather parts to propagate over NFS before attempting to merge the files.
Setting scatter/gather directories relative to the -run directory instead of the current directory that queue is running.
Fixed a bug where escaping test expressions didn't handle delimiters at the beginning or end of the String.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4717 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 22:59:42 +00:00
hanna 8ca5edf89f Fix issue where non-required file inputs can throw a NullPointerException
rather than a UserException when an the input argument is specified without
an argument value. 
The magnitude of code required to fix this points to a need to give the
command-line argument system a good spring cleaning.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4714 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 01:49:17 +00:00
ebanks b9a59ea54f Adding Het/Hom ratio to the temp per sample metrics. Because I'm in a generous mood tonight, I'm going ahead and fixing the paths for the classes I'm touching...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4713 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-21 04:24:42 +00:00
ebanks cff7c6ddce These are user exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4712 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-21 02:08:11 +00:00
bthomas 374c0deba2 Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 19:59:05 +00:00
kshakir c723db1f4b Added a -summary jexl argument to VariantEval similar to -validate.
Updated the package of ValidationGenotyper to match the file location.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4710 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 04:42:46 +00:00
kshakir 79725f2d9c Excluding the QFunction log files from the set of files to delete on completion.
When a QGraph is empty displaying a warning instead of crashing with an JGraph internal assertion error.
Cleaned up code using the Log4J root logger and explicitly talking to a logger for Sting.
When integration tests are run detecting that the logger has already been setup so that messages aren't logged twice.
Updated from Ivy 2.2.0-rc1 to 2.2.0.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4707 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:01 +00:00
depristo 721e8cb679 VariantsToTable now supports wildcard captures. -F PREFIX* now captures all fields that begin with PREFIX, output as a comma-separated list of unique values. Added integration test for VariantsToTable since I find it so useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4706 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:54:59 +00:00
depristo 8cba86a69d Trivial code organization for the haplotype score
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4703 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 12:32:55 +00:00
hanna 9f356b6cd0 Package all walkers in org/broadinstitute/sting/gatk/walkers directory in release.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4702 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 02:33:45 +00:00
hanna 90711d445c Change the interface for RMDTrackBuilder, therefore always mandating the specification
of a sequence dictionary and related info.  This will hopefully eliminate the cases in
which the refseq track depends a sequence dictionary / contig parser that hasn't been
specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 19:00:17 +00:00
fromer 367cc9135f Use VariantContext and Genotype accessor methods for attributes that will return null for unparseable data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4699 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 18:19:56 +00:00
fromer 2f3578182a Added VERY preliminary version for merging refseq annotations as SNPs are merged
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4698 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 16:49:12 +00:00
fromer e2f7f33ce7 Added getIntegerAttribute()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4697 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 16:33:07 +00:00
depristo d86ab2becb JEXL expressions now generate exceptions, not warnings. Tools should catch the runtime exception to handle correctly. Removed unncessary complexity from the JEXL contexts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4695 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 16:08:16 +00:00
delangel 539651de30 Initial version of Indel Statistics module for Variant Eval - not for general use yet, needs more verification and more work. Older IndelHistogram module will be obsolete with this new walker. Right now, for each sample (and for all samples), the following are computed:
- Number of insertions
- Number of deletions
- Length distribution for indels.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4694 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 15:52:01 +00:00
kshakir 01b721ab61 Passing ReviewedStingExceptions through the HMS.
Added a @Hidden experimental argument -validate to VariantEval that allows external JEXL assertions that must evaluate to true will throw an exception.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4692 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 21:50:42 +00:00
fromer 62f02bf30a Minor JAVA visibility updates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4690 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 15:28:58 +00:00
ebanks f1b0f3bc49 Putting my changes from earlier in the day back in after someone (rhymes with 'Dark') trounced on them with his last commit...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4687 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 01:55:50 +00:00
rpoplin b677080858 Initial checkin of the ValidationGenotyper. Not intended to be used by anybody yet. Only here for archival purposes at this point.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4685 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 22:33:49 +00:00
depristo ef2f6d90d2 VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 22:19:22 +00:00
hanna 5b83942cee - Fix DepthOfCoverage so that, when it abuses the ROD system by instantiating a track in onTraversalDone, it also supplies the correct sequence dictionary and parser.
- Changed RMDTrackBuilder to use SequenceDictionaryUtils.validateDictionaries for ref <-> ROD sequence dictionary validation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4683 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 20:34:04 +00:00
ebanks 2af508ef83 Better docs, as requested by Matt
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4681 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:24:15 +00:00
depristo 62be55376b no longer useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4677 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 17:53:33 +00:00
ebanks 35382468ee Better error checking/output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4676 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 16:36:34 +00:00
depristo 7a3a464959 Finally, the logic is right
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4674 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 14:02:09 +00:00
depristo 8d66637fc2 Bug fix for VariantsToTable with filtered records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4673 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 13:49:16 +00:00
depristo d76b87d6e3 Useful debug file output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4672 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 13:36:52 +00:00
ebanks 28142408ff Refactoring so that all counting in UGv2 is done on the filtered context. In particular, tests for empty pileups and too many spanning deletions now use the correct counts. Also, -all_bases mode now trumps all; this one is for you, chartl.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4671 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 05:01:12 +00:00
ebanks c7229abbf7 Get rid of 'meaningless and random values' that prevent Sendu from merging PG lines. I have to admit that he did have a good point there.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4670 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 03:35:12 +00:00
delangel cb1e8ad43a Temp bug fix for indel genotyper: if there are two or more variant contexts at a site, just choose the first one containing an indel and genotype that. There might be cases where IGv2 emits 2 indel variant contexts in at the same ref location which made us fail there. A better solution will be to form underlying haplotypes supported by reads and compute likelihoods of that.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4667 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 00:21:54 +00:00
depristo 82f9327b5e Throw the right exception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4666 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-13 22:18:42 +00:00
depristo 44d0cb6cde New version of cutting routines for VQSR. Old code removed. Working unit tests. Best practice with testng integration test (everyone look at it). Walker test now allows you to not specify no. input files, if it can infer input counts from MD5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4664 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-13 16:19:56 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00
depristo 988da428ae Bug fix for old style tranches file. ApplyVariantCuts moved over, and passes integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4657 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 14:38:26 +00:00
depristo c5f8c4dd0d VariantEval test for tranches file, plus cutting over VE to use the generic Tranches framework
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4656 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 13:52:40 +00:00
ebanks 69de3e51bf Better precision for the calculated AF value. Now looks at the total number of samples to determine how much precision is necessary. Also, changing default min BQ used for calling in UGv2 to Q17.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4655 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 08:31:40 +00:00
depristo ec83a4b765 Initial commit, without any tool changes, of a new infrastructure for determining tranches. This new version walker up from the lowest quality snps and determines Ti/Tv. This is marginally more stable than moving in the other direction when there are few novel variants (exomes). Can make a substantial difference in the size of the call set (10-20%). I'll hook it into the main system now. Includes an new class Tranche, isolated read/writing utilities that are now testing in TestVariantRecalibrator, which should be moved to UnitTest as soon as I can figure out how to do this on my mac.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4654 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 23:52:49 +00:00
depristo ed6396ed43 No longer getting the inet, it seems to potentially hang the JVM
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4653 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 23:49:42 +00:00
ebanks 2f6666a988 Correcting traversal statistics
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4652 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 22:46:58 +00:00
depristo dbde721dd0 Bug fix for filtered records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4651 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 18:54:51 +00:00
aaron 698e5cf345 for GATK style codecs, make sure we fill in their GenomeLocParser from the RMDIndexer
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4650 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 18:44:15 +00:00
delangel 2f3be24a00 Improvement in exact allele frequency calculation model (still under test, but this is definitely better than what I had before). Instead of approximating log(10^x+10^y) as max(x,y), approximate full Jacobian formula max(x,y)+log(1+10^-abs(x-y)) with static lookup table for the second term.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4647 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 01:22:35 +00:00
asivache 2e0296fef9 NWayOut logic slightly changed: 1) results.list file is gone; 2) now with -nWayOut one can specify either a) suffix to attach to every output file (i.e. cleaned reads from inputK.bam will be sent to inputK.suffix.bam) or b) *.map tab-separated file that must list <input_name> <output_name> mappings, one per line, for every input file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4645 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 20:32:16 +00:00
asivache a1adfb91ce And now @Hidden tags are really in place :-/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4644 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 20:28:40 +00:00
asivache 68ce55148e (pseudo-)genotyping functionality added: force-emits calls (including REF) at specified locations. Currently @Hidden for testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4643 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 20:25:40 +00:00
hanna 8e36a07bea Convert GenomeLocParser into an instance variable. This change is required
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.  

GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either

-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()

This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet.  This
will become clearer when contig aliasing is implemented.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 17:59:50 +00:00
depristo 4759fdd2ac V1 of read and variant simulator and assessor. SimulateReadsForVariants generates BAM and VCF with given combinations of variant and read properties. AssessSimulatedPerformance produces a table suitable for analysis in R
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4637 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-08 21:01:33 +00:00
aaron 97db593efb making my last commit message actually true
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4636 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-07 18:26:23 +00:00
aaron be499fc986 making the reference optional (the GATK will set it on the first run if it's not included), and setting the seq index if they do supply it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4635 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-07 18:15:31 +00:00
ebanks e05af54f3e Found the cause of 80% of our non-called FNs: an excess of filtered bases were causing us to choose the wrong alternate allele. More details to dev team.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4634 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-07 03:39:57 +00:00
aaron 2a8c97a4a7 better error catching, as well as allowing for default index naming, <filename>.idx
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4633 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-06 19:12:19 +00:00
aaron cb2e26a004 by request, an indexer tool to create Tribble style indexes outside of the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4632 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-06 18:59:06 +00:00
depristo bbb890dd6c Bug fix for variants in VCF header fetching to avoid null pointer when a VariantContext tribble codec doesn't have a header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4630 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-05 12:43:25 +00:00
ebanks c9dbd8f80a Bug fix for Tim: all point events must be treated equally
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4629 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-05 03:42:51 +00:00
rpoplin 913db5d1ab Unfortunately when annotating sites with the UG the -G None option was wiping out the single annotations added by -A options
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4625 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 19:27:23 +00:00
ebanks 816c86776e Walker description was wrong and it was bothering me
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4624 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 02:17:09 +00:00
ebanks 87f6738d4c Deprecated
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4623 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 02:07:40 +00:00
chartl 42e9987e69 Bug fix to GenotypeConcordance. AC metrics get instantiated based on number of eval samples; if Comp has more samples, we can see AC indeces outside the bounds of the array.
Bug fix to LiftoverVariants - no barfing at reference sites.

AlleleFrequencyComparison - local changes added to make sure parsing works properly

Added HammingDistance annotation. Mostly useless. But only mostly.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 19:23:03 +00:00
fromer 3d27defe93 Fixed output stats (percentage denominator)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4621 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 18:47:06 +00:00
ebanks 4e109f58bf In preparation for Ryan's jumping into SLOD: getting rid of bad hack to ensure P(AF=i) is calculated in the strand-specific cases. With Mark's recent changes this is no longer necessary and just makes the code slower.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4620 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 03:44:59 +00:00
fromer 22d64f77ff Added hidden --outputMultipleBaseCountsFile option to detect cases where a single read has more than one base at the same position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4619 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 03:22:48 +00:00
fromer a885ecf046 When merging MNPs, the phased flag and the phase quality (PQ) are determined simultaneously
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4613 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-02 14:44:26 +00:00
hanna 861ee3e37a Changing testing framework from junit -> testng, for its enhanced configurability.
Initial test to see how Bamboo will respond.  More detailed email to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 21:31:44 +00:00
asivache fe3f78e1d3 make it full (absolute) path for the file names recorded in results.list
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4608 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 20:53:51 +00:00
asivache 2ac5e55130 typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4607 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 20:38:02 +00:00
asivache 0e6dd38936 In n-way-out mode, added printing names of all the output files into 'results.list' file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4606 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 20:37:38 +00:00
fromer 64599d1074 Added debugging message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4605 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 19:51:42 +00:00
fromer 639ecdc931 Noted in comment that using a single sample in MergePhasedSegregatingAlternateAllelesVCFWriter does NOT update any of the INFO fields, though this could be changed in the future...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4604 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 19:02:52 +00:00
fromer 8439f0aa61 Check for VCFConstants.MISSING_VALUE_v4 when retrieving INFO fields and consider such values as non-existent
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4603 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 17:51:35 +00:00
asivache aadd230636 N-Way-Out is back. Now uses SAMReadID to identify each read's source bam, so should be reliable. Interface is sort of ugly fo now: to generate output file names, .bam is stripped from input file names, then the value of -nWayOut argument is pasted on (and all the output files are written into the current dir).
Unrelated change: in the sorted-target mode (when we read sorted target intervals one by on from a file), one can now specify multiple semicolon-separated interval files (all must be sorted). Not hugely useful probably, but makes --targetIntervals always process its values in exactly the same way, so we are consistent  (it has been already taking ;-separated args in unsorted mode)

NwayIntervalMergingIterator: reads in multiple sorted GenomeLoc input streams (iterators) and presents them as a single sorted and merged stream

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4602 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 16:06:51 +00:00
depristo 23cb399a88 Reasonable first pass at a correct SB calculation. Simple utilities to support it. VariantsToTable no longer prints filtered sites by default. New non-standard variant eval module to print comp sites not present in eval (FN finder)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4601 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-31 12:41:52 +00:00
delangel 30fae5cf18 Major redo of exact AF computation for UnifiedGenotyperV2. Fact of life is, there's no way we can compute an exact QUAL field and keep performing the AF computation in linear probability space. In good sites with lots of samples, the ratio of Pr(AC=K*|D) to Pr(AC=0|D) can be 10^1500 or some ridiculous large number like that, which no double can represent. So, we abandon probablity space and work now in log likelihood space, which has several major repercussions:
a) Sites were numerically well behaved now, but another hard fact of life is that the AF iteration is defined in linear Pr space, not in log likelihood space, and the math doesn't work out in log space. So, we need to convert back and forth from lin to log space.
b) As a consequence of a), the code got a major slowdown, and calling the 629 samples was about 15 times slower than before (sic).
c) To solve b), log10 of integers are now cached at init, and numerical approximations are now made. Most importantly, I'm using the approximation that log(exp(a) + exp(b)) ~= max(a,b) which seems almost inconsequential in practical performance but reduces computation time to what it was before. More detailes analyses are forthcoming. This approximation can be refined further on to avoid expensive log-exp conversions if further profiling and analysis deems it necessary.

Also, two other issues were solved:
a) Strand bias computation was actually wrong in the case where the optimal AC was bigger than max(forward reads,reverse reads). Now the code is exactly as buggy as the grid search model (all bugs are equal, but some are more equal than others)
b) Genotype likelihoods are now computed in a better way and if a likelihood < 0 we don't just cap to 0 but do something a bit smarter.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4600 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-31 01:26:04 +00:00
hanna d492621122 The TraversalEngine's habit of hanging onto old ROD states seems to have a bad
interaction with Tribble.  In Tribble, keeping these references in memory until
the shard is flushed means keeping one 512K character buffer per object in
memory.  Fixed by purging the reference to the object at the end of the 
shard traversal.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4599 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-29 17:09:58 +00:00
ebanks 1c056ea791 Users can now use VariantAnnotator to add annotations from one VCF to another. For example, if you want to annotate your target VCF with the AC field value from the rod bound to CEU1kg, you can specify -E CEU1kg.AC and records will be annotated with CEU1kg.AC=N when a record exists in that rod at the given position.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4598 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-29 16:38:31 +00:00
ebanks 1b3fc8ddd2 Doing things too quickly is also naughty. Thanks, Andrey. Now, we're even.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4597 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-29 14:50:04 +00:00
ebanks 58f7b4c595 Naughty use of assertions means that malformed records are not caught.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4596 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-29 14:41:38 +00:00
delangel 9a60e72364 Trivial change to LeftAlignVariants: make walker return number of aligned variants on map(), and print out the # of aligned variants at the end of the traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4595 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-29 02:03:36 +00:00
hanna 2f8057bf24 Cleanup for multithreading memory leak during integration tests...unregister MXBean at end
of traversal to avoid holding a reference to the microscheduler, which holds a reference to
the engine, which in turn holds a reference to the walker, which itself holds a reference to
all the data aggregated during the course of the traversal.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4594 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 18:37:42 +00:00
depristo 860de05a7c Bug fix for PL vs. GL in header. PL now truly default output for UGv2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4592 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 12:39:18 +00:00
depristo 9782dde3dd Bug fix for PL vs. GL in header. PL now truly default output for UGv2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4591 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 12:38:48 +00:00
ebanks fe3cfb067c very minor cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4590 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 02:11:33 +00:00
depristo cbce3e3c83 General support for both GL (log10) and PL (phred-scaled) genotype likelihoods. All walkers now use the Tribble GenotypeLikelihoods object for parsing VCFs with genotype likelihood fields. Please use GenotypeLikelihoods object from now on for seamless support for GL and PL tags. UGv2 now uses PL by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4589 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 01:48:47 +00:00
fromer 15183ed778 Reduced header to single sample when useSingleSample arg is given (to prevent lots of pointless no-calls)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4588 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 23:02:10 +00:00
fromer 34538bf2b3 Added ability to focus only on a single sample and/or emit only merged records in MNP merger
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4587 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 20:41:05 +00:00
kshakir 5cdd7a7ba4 There's no such thing as a sam index, so the GATK extension generator doesn't need to add an @Input for them.
Updated a call to swapExt to specify the directory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4586 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 20:39:03 +00:00
hanna 4c23b1fe9c Get rid of the static cache of ArgumentTypeDescriptors by making them an integral part of the
parsing engine.  Hugely lowers our memory footprint in integrationtests, but not yet enough to 
run Mark's new parallelized VariantEvalIntegrationTests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 19:44:55 +00:00
ebanks e112df20df Use a sorting VCF writer because records can flip positions during left-alignment
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4583 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 06:33:03 +00:00
ebanks 708e973911 Adding a walker to left-align indels in a VCF file (was able to reuse code from AlignmentUtils to do the hard part). The code correctly updates the alleles if they change. This makes it much easier to compare our indel calls to e.g. CG or dbSNP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4582 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 06:08:26 +00:00
ebanks ec442086ec Minor refactoring of the cleaner allows me to add a trivial walker that left aligns the indels present in reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4581 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 03:39:10 +00:00
ebanks ffc0ed2b32 Renamed getName() to getSource() in VariantContext to be more accurate
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4579 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 02:21:41 +00:00
ebanks 52fc023d80 Added convenience methods to check/get the ID of the VariantContext
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4578 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 01:56:58 +00:00
fromer a7af1a164b Updated MNP merging to merge VC records if any sample has a haplotype of ALT-ALT, since this could possibly change annotations. Note that, besides the "interesting" case of an ALT-ALT MNP in a pair of HET sites, this could even occur if two records are hom-var (irrespective of using phasing). Note also that this procedure may generate more than one ALT allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4577 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 01:50:36 +00:00
depristo e02aac0743 No longer print out 0 reads were filtered out... message when there were no reads scene at all
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4575 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 20:22:16 +00:00
depristo b085648141 Parallelized VariantEval. Refactored output to support parallel output style. Minor improvements to testing framework to enable easy executeTestParallel to run -nt 1 and -nt 4 by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4574 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 20:21:38 +00:00
kshakir 8211cee0b2 Queue UI Improvements:
- Forcing user to set the temp directory via -Djava.io.tmpdir to avoid filling up /tmp.
- By default deleting job outputs tagged as intermediate.
- Defaulting pipeline to scatter count 1 (no reads deleted).
- Cleaning up temp classes even when scripting fails.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4573 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 19:49:08 +00:00
ebanks cedceb33cd My only experience with getting external groups (GAP,dbSNP) to use VCF has been painful at best, so I'm not holding my breath to get indels for CG in VCF. To that extent, here's a oneoffs walker to convert from CG format to VCF for all 'del' & 'ins' types (but not 'sub' types, since they're too complex to code up in VCF and I don't care about them for now). rs ids are included.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4572 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 17:53:14 +00:00
ebanks 071799453c More complete fix to previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4571 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 20:47:37 +00:00
ebanks 67a776d53c Yikes! VariantEval was always loading genotypes unnecessarily when no sample list was provided because the order of the checks in the if statement wasn't optimal. This results in a massive performance penalty when running with many-sample VCFs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4570 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 20:30:23 +00:00
ebanks 0d97394c4f Add capability to liftover to do the right thing when sections of the genome are reverse complemented. This does not work for indels (we don't try to reverse complement) because we need to figure out what the hell to do about the fact that the 'base to the left' that we automatically add on will be wrong because the location of the indel actually changes when reverse complemented. Sheesh.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4569 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 20:03:03 +00:00
fromer c357ec775a Trivially phases any hom site (since it is always correct to continue the previous haplotypes by appending the same allele onto both haplotypes)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4568 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 16:58:41 +00:00
rpoplin da64183854 Fix for the case of the truth VCF file having multiple SNPs at the same locus.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4567 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 15:04:50 +00:00
hanna 3039c0de3c Retire old ROD syntax.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4564 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 23:52:11 +00:00
depristo 78e71c4167 Fisher exact makes a return. Seems to be working properly. Current tagged as a work in progress. Needs to take the filtered context to be truly correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4561 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 20:35:44 +00:00
fromer f06f955e06 Added count of number of mergeable records (within specified distance cutoff)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4560 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 20:11:15 +00:00
depristo 84b6d2926b Useful walker that creates a new interval list with only the interval overlapping input sites list. Really a one-off walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4559 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 19:55:04 +00:00
depristo 78b4a1c240 VariantsToTable now supports the virtual TRANSITION field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4558 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 19:53:46 +00:00
hanna e6d61197e6 Disable OTF indexing when writing indices for temporary VCFs when running
with -nt option.  When last I checked in, Ryan was seeing a ~25% speedup 
per shard by not indexing.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4556 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 17:40:37 +00:00
depristo e6b008f87c Fixed >= vs. > test leading to failure to tolerate dynamic indexes that are created at *exactly* the instant the output VCF is closed too
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4555 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 16:11:14 +00:00
ebanks 72c5b75460 Tribble exceptions can be generated outside of the normal codec parsing code because we now lazy load the VCF genotype fields. I'm not sure how else to account for this (to make sure they show up as user errors and not GATK system errors) besides catching them here.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4554 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:22:17 +00:00
delangel e24f7fec47 Fixed indel genotyper which broke yet again because we can't just call context.getBasePileup() without checking again for its existence in the first place.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4553 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:17:11 +00:00
ebanks c0b4317311 Er, here's the right fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4552 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:08:25 +00:00
ebanks 181f901126 Fix for Ryan: don't pull reference sequence for the portions of reads that extend beyond the contig boundaries
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4551 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 14:38:26 +00:00
ebanks 9f76aed515 Fix for IDs 5zP7jJeffK2sdPH1BH4JBVSrQztVEDKP and nX0cuBjoqBW4NQFpM6dE13KpkCuYFpZu
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4550 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 14:05:27 +00:00
hanna d4feb99d9a For parallel ROD traversals, simplified reference sharding. Will replace
with a more sensible strategy for sharding w/o BAMs at some point after
ASHG.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4549 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 05:08:15 +00:00
fromer 60f88866dd Uses VCFConstants instead of hard-coded constants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4547 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:49:01 +00:00
fromer 883b8ff80e Removed flush() method from VCFWriter interface; added takeOwnershipOfInner parameter in constructor of wrapper VCFWriters to designate if the Writer should close the inner Writer it receives on construction
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4546 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:48:00 +00:00
fromer 1ea43be976 Removed flush() method from VCFWriter interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4545 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:46:42 +00:00
chartl 3566ad2146 Wrong if statement.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4544 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 17:37:45 +00:00
chartl bf17f92b64 Do not look for samples in dbsnp binding
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4543 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 17:36:38 +00:00
ebanks 225cf49128 Implementing reference confidence estimate in UGv2 as per UGv1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4542 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 16:57:59 +00:00
delangel cf9c9ae241 Three important updates for Dindel genotyper:
a) Fix it up because it broke with a recent checkin to annotate vcf with unfiltered depth.
b) Printout of ref/alt alleles in output vcf was incorrect because the start/stop positions of associated GenomeLoc were incorrectly computed in case of a deletion.
c) Redid Beagle input/output walkers as not assume that ref was a single base, not to assume that variant was a vcf and generalized it to be indel-capable, so now the Beagle walkers can be used for indels as well.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4541 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 16:00:16 +00:00
ebanks 8f38ebf98e Throw a user exception when using the clustered SNP filter in the presence of ref calls. It's unfortunate, but until we get a windowed ROD context this is just too much of a headache to support.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4537 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 02:44:10 +00:00
kshakir 88a0d77433 Changed parsing engine to store the order the argument bindings based on their definition in the class, moving "-T" to the front of Queue command lines.
Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator.
Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered.
Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers.
Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 21:43:52 +00:00
ebanks 91049269c2 Optimizations across the board, with help from Guillermo, Matt, and JProfiler. Too tired to give details now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4535 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:47:41 +00:00
fromer f76865abbc ReadBackedPhasing now uses a SortedVCFWriter to simplify, and has the ability to merge phased SNPs into MNPs on the fly [turned off by default]; MergeSegregatingPolymorphismsWalker can also do this as a post-processing step; Integration tests for MergeSegregatingPolymorphismsWalker were also added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4534 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:27:10 +00:00
fromer e8079399ac Added flush() method to VCFWriters
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4533 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:23:22 +00:00
fromer 00726b6c4b Added mergeIntoMNPs to merge successive VCF records into a single MNP VCF [if possible]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4532 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 19:40:26 +00:00
fromer 55230ce5f3 Added startsBefore, startsAfter, and minDistance [calculates distance between any pair of bases in the two GenomeLocs]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4531 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 19:12:34 +00:00
ebanks 4f77581087 More optimizations for HaplotypeScore: pulling final constants out of loops
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4530 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 17:40:57 +00:00
hanna 20fac43521 Add extra logging to the GATK run report at the start of metrics aggregation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4529 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 17:32:51 +00:00
ebanks a205900eff Naughty use of Strings in HaplotypeScore literally double the runtime of Unified Genotyper. Moved over to bytes and no longer allow Strings in the Haplotype util class. New round of profiling on tap for tomorrow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4528 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 03:32:21 +00:00
depristo f9541b78d3 Timing of traversal now starts at the start of the traversal, so the rate is reasonable right off the bat. For example, we now see: INFO 22:45:02,476 TraversalEngine - [TRAVERSAL STARTING]; INFO 22:45:32,484 TraversalEngine - [PROGRESS] Traversed to 2:50850686, processing 18,646 sites in 30.05 secs (1611.50 secs per 1M sites)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4527 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 02:47:34 +00:00
depristo f7ce18553e GenotypeConcordance now prints interesting sites more nicely. RMDTrackBuilder is now uses the root class FeatureSource not BasicFeatureSource.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4525 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 00:29:02 +00:00
ebanks 7a291a8ff3 First pass at a VCF validator. Will test more tonight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 19:55:49 +00:00
chartl 341e93ee12 The reference fixer seems to have munged the OMNI rather than making it better. Looks like some sites need to only have the ref and alt bases swapped, and others need to have the genotypes swapped as well? E.g.
some subset need
A  C  1/1   -->  C  A  0/0

while another subset need
A  C  1/1   -->  C  A  1/1

it's unclear how big these subsets are (or even if one is empty). What I do know is, doing the first one totally screws up concordance metrics for the 421-sample chip. So either something else needs to be done, or there's a bug in this walker. Until I know for sure, I've added an initialize exception to disable this thing...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4523 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 12:50:24 +00:00
ebanks 5251f49a90 Including Marian Thieme's BaseCounts class (with some modifications)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4522 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 03:07:30 +00:00
hanna c5f105d050 Fix boneheaded mistake in the new interval filtering code I added on Sunday.
Sorry everyone.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4521 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 01:20:12 +00:00
ebanks 524cb8257c Renaming for accuracy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4519 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:11:07 +00:00
ebanks 0fe504b748 Use filtered depth for Exact model (just like grid search)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4518 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:08:31 +00:00
ebanks d54d9880d7 Now that G's new genotyping algorithm is live, I've cleaned up the code to completely separate the grid search from the exact model. AlleleFrequencyCalculationModel is now completely abstract.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4517 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:04:06 +00:00
ebanks 80e5ac65b4 CAP_BASE_QUALITY needs to be included in the clone() method for it to be usable in UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4516 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 03:11:03 +00:00
hanna 6af9532090 Fix for GATK slowdowns at the ends of intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4514 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 23:21:23 +00:00
chartl 5889138f4a *facepalm*
forgot to add the samples to the header. How could the VCFWriter let me get away with something so boneheaded?!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4513 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 05:36:29 +00:00
chartl 2bc5971ca1 Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it.
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. 

**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.

I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 03:18:01 +00:00
ebanks 7aa030a9a4 Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 02:54:12 +00:00
chartl 8b2d387643 Added in an eval module that calculates the dispersion histograms between eval and comp (e.g. M_{i,j} = # of times eval observed to have AC i, comp AC j -- for af it's i/100 vs j/100 )
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4507 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 19:07:43 +00:00
ebanks f78ff08e2b This is less correct than my previous change but it's what UGv1 does and now is not the right time to start mucking with things.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4506 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 18:56:45 +00:00
ebanks 471c18054f Fix for SB calculation: the best overall AF might not have any mass when just looking at reads from a single strand. We need to compute the best AF for each stratification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4505 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 17:51:18 +00:00
asivache 42c3d74432 bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4503 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 16:27:40 +00:00
chartl c9d473edee More changes to Variant Eval and Genotype Concordance (passes all integration tests):
1: -sample can now include a file, which will be parsed for sample-name entries
2: If you request a sample to run analysis on, but it is not present in any of your RODs, VEW will exception out
3: Change added to parse Integer, String, and List<Integer> type Allele Count annotations (error otherwise)
4 [slightly problematic]: The count objects now maintain row-keys in order, as the keys were taking an inordinate amount of time in onTraversalDone (multiple calls to getRowKeys(), so many multiple sorts of the same underlying unsorted object, very bad)

There is a legacy comparison object which is unused which I will strip out soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4502 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 12:40:36 +00:00
ebanks 9f54170dff Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 07:27:01 +00:00
ebanks d41c252b13 Looking over the calling results with Ryan, it's clear that while the grid search optimization (ignoring samples that are clearly ref) can work for assigning genotypes, it cannot be used for calculating P(AF>0). There's too much area under the likelihood curve that gets lost and the QUALs are negatively affected. However, testing showed that this only slightly affects runtime (~15 minutes per 1Mbase for the 1kg allpops). The optimization does remain for genotyping.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4498 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 19:06:32 +00:00
ebanks 2606e67cf1 Reverting Matt's change from yesterday which I accidentally blew away when trying to cope with the stupid svn update issues we've been plagued with recently.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4495 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 14:40:42 +00:00
ebanks cfb33d8e12 Filtering optimizations are now live for UGv2. Instead of re-computing filtered bases at every locus, they are computed just once per read and stored in the read itself. Eyeballing the results on the ~600 sample set from 1kg, we cut out ~40% of the runtime! QUALs are now sometimes different from UGv1 because I noticed a bug in v1 where samples with spanning deletions only were assigned ref calls instead of no-calls which ever so slightly affects the QUAL. Not a big deal though.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4494 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 05:04:28 +00:00
chartl 4ac636e288 Minor change: when tabulating concordance by AC, ignore sites with multiple segregating alleles in the population, at least for now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4493 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 01:35:33 +00:00
chartl 7c9ef59d65 This is simultaneously a minor and major change to VariantEval, so take heed:
The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests.

GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g.

evalAC0
evalAC1
evalAC2
compAC0
compAC1
compAC2

Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 22:26:15 +00:00
hanna 83b8676b69 Hack to fix mysterious disappearing read attributes. Ultimately caused
by the fact that the GATKSAMRecord, by design, needs to both inherit from 
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.

Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation.  But in 
the short-term, there's no obvious fix.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 19:06:54 +00:00
depristo da29fcdb68 No longer writes the index to disk twice. But fixes for closing VCFWriters throughout the codebase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4488 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 14:26:06 +00:00
aaron 28a1020c89 comment out debugging line that was clogging the performance test output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4487 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 03:26:55 +00:00
aaron 272ac2ae4a more fixes for tests broken by indexing-on-the-fly; I think this should do it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4486 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 01:54:32 +00:00
hanna ed39af53cd Fix for exception when trying to load reference segment for a read that aligns
to 0 bases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4485 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 23:50:51 +00:00
ebanks fe9f128631 Better fix for earlier bug.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4484 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 19:21:33 +00:00
aaron ff0df1a2da A fix for an integration test that was broken by on-the-fly indexing. Also, better reporting of Tribble exceptions in GATK integration tests. Trying to get the tests back up and running...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4483 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 18:39:56 +00:00
ebanks 69652e08c6 Bug fix for reads that completely fall within an insertion: the I cigar string element was 1 base too long.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4482 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 14:46:21 +00:00
kiran f348ca2976 Now processes VCF files with repeated loci without crashing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4481 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 04:36:07 +00:00
ebanks fd8351cd49 Get rid of useless test/'optimization' that was carried over from UGv1. New codde is (minimally) faster with same results.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4478 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-11 04:04:07 +00:00
ebanks f28523e7de Implemented SB for UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4477 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-11 03:56:01 +00:00
hanna 7008a469dc Update MalformedReadFilter to pass reads that have cigar strings like 40S36I
that have 0 aligned bases in the genome.  We'll have to fix walkers as faults
appear.

Also added JIRA GSA-406: finer-grained control of MalformedReadFilter: want
to exception out by default in these cases but pass them with a warning with
a corresponding -U flag.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4476 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-11 03:01:04 +00:00
ebanks 530875817f Experimental code for better filtering of bases in sam records. Not hooked up yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4475 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-11 02:19:51 +00:00
ebanks a0de269c4b Better message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4474 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-10 20:11:51 +00:00
rpoplin 0a4cf02a52 Fix for index out of bounds exception in VR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4473 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-10 17:35:15 +00:00
depristo 116309b3c3 More test cases for UG integration test. We currently fail doing multi-threaded gzip output, FYI
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4472 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 20:22:12 +00:00
depristo 38a67fed63 High performance version of standard vcf writer. New general static Tribble class for common constants, including general .idx constant and functions to get standard index name for a given file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4471 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:53:21 +00:00
fromer bdd3a9752e Changed min MQ and BQ to 20 (for phasing)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4469 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:27:45 +00:00
asivache 05500d1a8d An iterator wrapper/adapter: takes GenomeLoc iterators 1 and 2 and traverses intersections of intervals from 1 with intervals from 2. Both 1 and 2 must be SORTED and NON_OVERLAPPING, but this iterator does NOT perfrom any checks, so if these conditions are not met, the behavior is unspecified
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4468 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 16:34:00 +00:00
asivache 253d528e49 not ready for commit yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4467 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:30:55 +00:00
asivache 4f2f33b42a fix method invocation to conform to new API; this version of the code will compile but new functionality is still not fully in
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4466 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:30:26 +00:00
asivache cece19d4d2 not ready for commit yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4465 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:14:54 +00:00
asivache 39e373af6e deleting accidentally committed junk
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4464 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:13:01 +00:00
asivache b3d81984aa renaming MergingIterator to RODMergingIterator as it is more appropriate for this specialized implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4462 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:10:11 +00:00
asivache 77dddd0afa renaming MergingIterator to RODMergingIterator as it is more appropriate for this specialized implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4461 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:08:28 +00:00
chartl 21ec44339d Somewhat major update. Changes:
- ProduceBeagleInputWalker
 + Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
 + Takes a bootstrap argument -- can use some given %age of the validation sites
 + Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
 + Now filters sites where the genotypes have been reverted to hom ref
 + Now calls in to the new VCUtils to calculate AC/AN

-Queue
 + New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
 + full calling pipeline v2 uses the above libraries
 + minor changes to some of my own scripts
 + no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 13:30:28 +00:00
ebanks 97b153f2fa Quick fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4457 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 06:10:52 +00:00
ebanks acd238f3f2 For Chris: pull out the chromosome counting code into VCUtils so that other tools can make use of it. Transitioned SelectVariants over to use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4456 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 04:37:54 +00:00
delangel 3838823262 Two ugly hopefully temporary fixes for new genotyping model:
a) In Indel genotyper: we can't deal yet with extended events correctly and we are still triggering at each extended event which results in repeated records on a vcf. So, to avoid this, keep track of start position of candidate variantes we've visited and if we've visited a variant before we don't do it again.
b) Avoid infinite terms in QUAL and in genotype likelihoods which can happen if posterior AF happens to be exactly zero. For now, hard-code a minimum value of each term of the posterior AF likelihood to be -300 (ie 1e-300 in lin space). This can be solved with better and smarter log-to-lin conversions and some precision fixes in AF calculation.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4455 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 00:53:16 +00:00
rpoplin 0de658534d Removed the qScale arguments in VariantRecalibrator. It is smarter about how it tries to find a cut so the arbitrary scale factor hopefully is no longer necessary. Now the recalibrated variant quality score more accurately reflects our believed lod of the call.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4451 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 18:04:57 +00:00
fromer ee00dcb79d 1. Phasing now ignores bases without minimum base quality (BQ) and minimum mapping quality (MQ); 2. The probability of a non-called base is now divided by 3, to evenly split up the error probability over the non-called bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4450 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 17:40:59 +00:00
fromer f8f1cc45a3 Now ReadBackedPhasing caps Base Quality by Mapping Quality
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4445 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:48:57 +00:00
scalvo bda427f078 Change specification of AnnotationInputTable, and fix 2 bugs.
Previous output spec contained 3 columns:
 haplotypeReference,haplotypeAlternate,haplotypeStrand
where haplotypeReference was always on the + strand, and haplotypeAlternate was on the strand specified by haplotypeStrand.

The new specification contains 3 columns:
 haplotypeReference,haplotypeAlternate,transcriptStrand
where haplotypeRef and haplotypeAlt are required to be on the + strand.  transcriptStrand now specifies the strand of the transcript, which is needed for interpreting the haplotypes.

Bugfix #1: fix incorrect assignment of variantCodon and variantAA
(Previously variantCodon was incorrectly set to referenceCodon)

Bugfix #2: fix incorrect codingCoordStr values for - strands (bug reported by Giulio Genovese), and incorrect usage of "m." for mitochondrial transcripts (bug reported by Steve Hershman)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4444 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:46:09 +00:00
scalvo b5c127e643 Removed HAPLOTYPE_STRAND_COLUMN; Previously, GenomicAnnotation allowed a user to specify the strand of the haplotypeAlternate, and would reverseComplement the haplotypeAlternate if HAPLOTYPE_STRAND_COLUMN was "-". The new specification does not allow this functionality, and instead requires both the reference and the alternate haplotypes to be on the + strand (as in VCF format).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4443 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:37:41 +00:00
kshakir ca5db821ce Added the ability to Queue to run scala functions inside the JVM. NOTE: Extend from InProcessFunction instead of CommandLineFunction to use this functionality.
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF.  Plus side is that the full command is now available in "bjobs -w".  TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 18:29:56 +00:00
ebanks 3c5dc675ab For Guillermo: only decide that something is a clear reference call if it is at least 10 times as likely as the next best genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4441 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 15:16:41 +00:00
depristo 00491fcd2e Only see not writing GATK Run Report if you are running with debug enabled
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4437 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:09:21 +00:00
rpoplin 69485d6a7a Added command line argument for the max value of the allele count prior in VariantRecalibrator (--max_ac_prior). Default value increased to 0.99 from 0.95.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4436 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:00:53 +00:00
ebanks 3d564f4a29 reverting an accidental change from the dindel merge
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4434 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 03:08:09 +00:00
ebanks b5e148140b Officially fixed the UG priors; updated the default min MQ/BQs to pipeline values of q20 and min calling threshold to Q50
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4431 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 18:35:36 +00:00
fromer c6668bd49c Fixed bug in phasing, where mapping probability was incorrectly raised to the power of number of non-null bases [instead, it is just multiplied into phasing probability once]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4430 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 17:07:31 +00:00
hanna 250c18e679 Error message fixes for the following issues:
nvjpM4yOwQAu3fNGxi4oXLuVpKn6aAlf,1GL0OuXK2xKQfvbu34tWYgbojSVSLo0l,
ehEGBJOfgc4V7qj8W0Homf5ICuVK5Sm3,cZsreLm1CbY3aYKZhV7DOSvQNwur41zp,
GlrlyGEyP9kJDIRCQNFQp7BGJBXSzdDJ,hyz1uiHXr39ANmdZu9K1epOSX8EL3mDw,
q0n4EucZESCI4LZhQik306zD4VAuH2cb.  

Messages:
camrhG5tHzlY9WUSEVpVZGkU1tyJqKb5,s0OX2g7nYRctJxyFoQCa6clac9IsjHyi,
THIAtjllvYNlnTmiMnJEIHd2Ju4gqQIO,jwVk3JYZJNHloW7HO4LeGxFexknqro0v,
BFNRGOGmGGJNNPZqgeF1ikTNFfskbyLc,...

Were fixed in 4392.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4428 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 03:37:13 +00:00
ebanks aa00801108 remove reference to -mrl
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4423 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 17:27:01 +00:00
chartl f978c25b9d Perhaps both, Eric. Perhaps both.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4422 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 13:56:04 +00:00
chartl 0eb777612a Swap "." over to VCFConstants.MISSING_DEPTH_v3
Why v3, you ask? Why not? Simply because v2 was a String so old and clunky, the sun would fizzle out and grow cold before any VCF could be successfully parsed.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4421 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 13:41:41 +00:00
chartl 74087c44ae Fixed a bug which caused a parsing exception when there was a variant with a dp field of ".", e.g. "GT:DP 0/1:." -- which can happen when using imputation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4420 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 12:37:36 +00:00
ebanks 6448753cf7 Removed the SequenomValidationConvertor and renamed it VariantValidationAssessor since it no longer handles ped/sequenom files (but instead works on vcfs/variantcontexts). Updated all of the wiki docs, including adding instructions on how to convert ped files to vcf, a la Shaun Purcell. We now officially no longer support ped files everyone. Other misc cleanup in the code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4419 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 02:11:38 +00:00
ebanks d8db48204e Fix typo and tell people not to post user errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4415 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-03 18:58:03 +00:00
ebanks 490e5e1b0f Better error when bad ref bases are provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4414 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-03 05:40:37 +00:00
aaron 64b7b3f83b fix for a recent change to the indexing code where we ignore the results of locking the file (this is bad), and as a result don't write the index; this should fix the build.
Off to Yosemite in 4 hours, enjoy the week gsa folks!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4410 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-02 04:35:11 +00:00
depristo 7551ba8249 Trival refactoring in preparation for on-the-fly indexing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4409 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 22:32:59 +00:00
rpoplin 2f7892601c Useful debugging argument added to VariantRecalibrator to only use sites whose qual field is above --qual
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4406 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 21:08:55 +00:00
hanna 575c38fc04 Accidental fail to commit missing file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4405 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 20:26:51 +00:00
delangel d4398f2686 silly bug fix: if I'm to do a short term hack to avoid -infinity likelihoods I might as well do it right.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4403 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 18:39:45 +00:00
hanna 8d25a5f9f2 A mechanism for supplying attribution text -- mainly useful for external
walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4402 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 18:31:19 +00:00
delangel e920badcc4 Temporary fix for case where genotype likelihoods are exactly (1,0,0) or (0,1,0) etc. at a site with new indel genotyper: this would make us blow up when converting to log space and try to assign genotypes at a site. A more robust solution is in the works.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4401 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 17:43:43 +00:00
rpoplin b83fdf8a17 Bug fix in AnalyzeAnnotations. Be sure the site is a biallelic, unfiltered SNP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4400 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 13:09:46 +00:00
delangel fa9c21c020 More fixes for exact AF calculation model in new unified genotyper:
a) Fixed bugs in new dynamic programming-based genotyper
b) Fixed up temp hack that handles extended pileups for now.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4398 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 02:32:50 +00:00
delangel eb67aee732 bug fix: forgot to uncomment code to compute genotype likelihoods
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4397 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 21:38:22 +00:00
delangel ece694d0af Next iteration on new UG framework:
- Brought over exact AF estimation from branch (which is now dead). Exact model is default in UnifiedGenotyperV2.
- Implemented completely new genotyping algorithm given best AF estimate using dynamic programming, which in theory should be better than both greedy search and any HWE-based genotyper.
- Integrated and added new Dindel likelihood estimation model.
- Corrected annotators that would call readBasePileup: since we can be annotating extended events, best way is to interrogate context for kind of pileup and either readBasePileup or readExtendedEventPileup.

All changes above except last one are still in playground since they require more testing.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4396 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 21:33:59 +00:00
hanna bf7fd08810 Fix newly-introduced bug in the PluginManager/DynamicClassResolutionException
where, when the system can't find a plugin of the correct name, the system
prefers to crap all over itself and throw an unintelligible NullPointerException
rather than displaying an intelligent error.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4393 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 19:07:05 +00:00
hanna 14e19f4605 (Slightly) better exception text when SAM/BAM output file can't be created.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4392 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 18:43:22 +00:00
hanna 1fb8c86f6d Looks like we've got two competing models for an empty interval list: null and
the empty list.  Score another victory for the integration tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4391 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 17:11:47 +00:00
hanna 78343be52c At some time in the recent past, we lost our ability to process the '-L all'
argument.  Brought it back, and added an integrationtest to make sure it
stays around.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4390 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 15:58:43 +00:00
delangel e80742e72f Use -o as argument for output file in ProduceBeagleInputWalker, to be consistent with other walkers (you're welcome, chartl :)).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4386 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 22:46:39 +00:00
hanna 732aa32758 Every Sting app from now on will be forced into the US English locale.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4385 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 21:55:21 +00:00
fromer 20ffe484bc Added detection and INFO field marking of phasing inconsistencies (and optional filtration using --filterInconsistentSites)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4384 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 19:28:56 +00:00
rpoplin a6c7de95c8 By using the AC info field instead of parsing the genotypes we cut 78% off the runtime of VariantRecalibrator. There is a new argument to force the parsing of genotypes if necessary. Various other optimizations throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4383 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 18:56:50 +00:00
ebanks 2d1265771f Fix for G: make sure to generate the genotype conformations in the grid for the target frequency when not using grid search for anything except the conformations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4382 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 16:44:53 +00:00
delangel 4556e3b273 First iteration in filling up exact AF calculation with new refactored UG. Code computes EM iterations of exact AF spectrum and returns to caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4381 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 16:21:54 +00:00
ebanks 0d71dff928 Small bug fix to the new UG (need to initialize the entire posteriors array) means that we also get identical results as old UG when calling with 60 samples in the pilot1 data. Now that I'm happier with UGv2, I've transitioned it to use the correct AF priors instead of the busted ones still in the old UG.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4379 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 14:24:50 +00:00
hanna eee134baf2 Chris found a bug in the downsampler where, if the number of reads entering
the pileup at the next alignment start is large, we don't add as many of those
incoming reads as we should.  No integration tests were affected.

Thanks, Chris!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4378 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 11:18:12 +00:00
ebanks 0ec07ad99a Initial version of refactored Unified Genotyper. Using SNP genotype likelihoods and GRID_SEARCH AF estimation models, achieves the exact same results as original UG on 1-2 samples with the exception of strand bias (not implemented yet); other than that I have no idea. Needs tons more testing. Do not use. For Guillermo only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4377 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 08:42:25 +00:00
kshakir 6df7f9318f For enums generate the full path to the Enum type to avoid collisions such as enum Model and enum Model used in the same class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4376 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 05:28:59 +00:00
fromer e322e71c2f Restored SVN history for phasing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4373 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 00:02:02 +00:00
fromer 720aaca8a0 Trying to restore SVN history for phasing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4372 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:50:28 +00:00
fromer bf88117ead Trying to restore SVN history for phasing directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4371 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:48:24 +00:00
fromer dfb5143a41 Restore folder
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4370 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:46:07 +00:00
fromer 7c909bef82 Moved phasing classes out of playground! The code is still under production, though...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4369 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:21:28 +00:00
fromer 8d8980e8eb Fixed phasing algorithm to: 1. More correctly weed out irrelevant reads and sites; 2. Crudely flag sites with large phase discrepancies betweens reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4368 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:02:53 +00:00
chartl 5a5c72c80d Accidentally commited some debug output to PackageUtils, reverting change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4367 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 21:58:42 +00:00
chartl 862c94c8ce Small change for Matt -- output partition types in lexicographic order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4365 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 20:08:03 +00:00
ebanks 7ad87d328d Make sure to uppercase ref bases since they aren't coming from the engine
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4364 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 19:05:46 +00:00
bthomas 96cccafb0d Adding a few helper methods for accessing sample metadata, and associated unit tests. These are motivated by discussion with Ryan about how he'll use sample metadata in VariantEvalwalker - hopefully will make it easier for him. Methods are:
-- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value
-- getToolkit().getSamplesWithProperty(): gets all samples with a given property
-- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 02:16:25 +00:00
ebanks 1034853a84 Adding 'solexa' to list of known/supported platforms
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4357 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-27 02:38:38 +00:00
aaron 70f03a7113 first pass of well-formatted tribble exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4352 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-25 03:29:33 +00:00