Commit Graph

4122 Commits (57353294cc63f9d9580557c1b27bcf21d1dab6eb)

Author SHA1 Message Date
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
asivache 04d66a7d0d Updated integration test's MD5s reflect the fact that assay sequences were previously designed incorrectly for indels, the bug is now fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5120 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:00:22 +00:00
scalvo 5934b9cb82 Augment function isChrM by allowing "CRS" in addition to "chrM" or "MT", as a standard contig name indicating the mitochondrial chromosome. CRS stands for Cambridge Reference Sequence and is the standard in the field.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5119 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 22:45:45 +00:00
asivache 7af0532292 An attempt to have more intelligent sorting of RODs. Tested with maf only so far. Should be able to reference-sort dbsnp, bed and vcf as well, bugs nonwithstanding. Very simple, brute-force implementation using SortingCollection. Should I have used tribble indexing machinery instead?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5118 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 22:10:07 +00:00
asivache fa8963522b Ignore header line if it happens to be passed to the codec again, instead of crashing on it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5116 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 21:44:33 +00:00
asivache 8d389e149f Now can deal with input files that contain multiple copies of the same event. Only one assay sequence will be designed for each distinct variant, redundant variants will be discarded. Redundancy is defined as same start, same variant type, same ref and alt alleles (it does not matter, e.g., what the sample was as we do not record sample information anywhere).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5115 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 21:42:29 +00:00
fromer f2de39d661 Calculates phase concordance rates between trio and RBP-phasing tracks, stratified by trio status (Het3, non-Het3)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5114 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 20:50:01 +00:00
fromer ffd5f407a5 Retain only a single walker to perform calculation of haplotype extents
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5110 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 18:33:32 +00:00
depristo 2182b8c7e2 Better query start / stop function that directly parses the cigar string, unlike the previous version. Now properly handles H (hard-clipped) reads. Added -baq OFF and -baq RECALCULATE integration tests on all three 1KG technologies. Please let me know if this new code somehow fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5108 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 15:08:21 +00:00
kiran 9cb1ae384c Constant precision for floating point numbers. Added integration test - carries over tests from VariantEval with the necessary modifications to command-line arguments and md5s. Disabled use of 'synchronized' keyword because I clearly don't get how that keyword is supposed to work yet...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5107 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 05:19:18 +00:00
depristo f29bb0639b Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:40:09 +00:00
asivache f036a178f1 Added support for MAF features. So far works for MAF Lite only, annotated MAF is NOT TESTED yet AT ALL.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5105 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:20:46 +00:00
fromer 91e4bb0285 Added walker to calculate haplotype lengths for ALL fragments produced by stitching together phased sites (actually, stitching together everything BUT unphased het sites)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5104 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:09:20 +00:00
asivache ac3fd567b4 Ugly one-off error fixed in building design sequences for indels: the event position is immediately *before* the event, so the ref base at the current locus is the base immediately *before* [ref/alt] element
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5103 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 02:53:03 +00:00
kiran 3e9f185dad Fixed issue with GenotypeConcordance being initialized incorrectly when the first seen comptrack had no samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5102 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 01:12:27 +00:00
kiran 58f0ecff89 Fixes to support evaluations with TableType elements - each such object now gets a separate entry in the output table. Added codon degeneracy stratification. Handle null elements in reports (useful for debugging).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5101 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 22:09:59 +00:00
hanna a264b16358 Patch from Brett (with minor tweaking by me) to expose all the relationships
of a particular sample in hash format.  Thanks, Brett!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5100 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 21:46:13 +00:00
fromer 9c728979cc In order to calculate haplotype lengths of trio+RBP, I implemented a simple trio phaser as an option to ComparePhasingToTrioPhasingNoRecombination, which already decides if the trio could theoretically phase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5099 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 20:17:48 +00:00
depristo 5ed128f839 Slightly more tolerant timing setting. Main() method in GenomeLocProcessTracker to generating timing data for trackers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5097 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 15:16:07 +00:00
depristo 61c29d550d Fix for NullPointer where a run starts but there's nothing to do (no shards) and reduceInit() wasn't being called correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5096 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 15:15:10 +00:00
depristo f522eb2848 Previous tests were just too big...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5095 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 13:48:38 +00:00
kiran 2901299ff6 Sets the number of samples to all of the samples in the file when it's not specifed on the command-line explicitly. GenotypeConcordance no longer a standard evaluation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5094 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 01:38:26 +00:00
hanna 4a33cdacde Some basic integration tests detecting breakage in OTF BAM index generation.
Doing it manually for the moment so that there's at least something testing
this capability; will followup eventually with Mark to see whether we can
shape the VCF index generation code in such a way that it supports BAM index
testing as well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5093 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 23:48:04 +00:00
fromer 466f8f8a3c Compares RBP phasing to a simple trio phasing model that can phase a child het iff both parental genotypes are known and at least one of them is not het [at EACH of the sites in the pair to be phased]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5092 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 23:43:29 +00:00
asivache 43812a28fc If among all the multiple alignments for the given read we have 'unmapped' ones (can happen with bwa 0.5.7 and maybe later versions), then discard the latters and keep only the mapped ones. Keep 'unmapped' only if its the only alignment available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5090 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 20:07:08 +00:00
asivache 63b709d992 When remapping the read, set MAPQ, CIGAR etc to 0/null for unmapped reads. This is not required according to spec but current samtools jdk otherwise dies in STRICT validation mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5089 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:49:07 +00:00
ebanks d33162145b Moving the --sites_only argument up into the VCFWriter itself so that any walkers that write VCFs can choose not to emit genotypes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5088 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:38:16 +00:00
kiran a97184fddf Frick! Changed to refer to the *playground* version of VariantEvaluator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5087 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:33:03 +00:00
kiran a9d0772516 When evaluating JEXL expressions, on't blow up if the eval VC is null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5085 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 18:25:03 +00:00
kiran 22e599ec76 Fixed output report to properly handle evaluation modules with TableType objects. Promoted CpG to a standard stratification. Demoted Filter to a non-standard stratification. Now, if the filter stratification is not specified, VariantEval only evaluates PASSing sites.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5084 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 17:38:21 +00:00
ebanks 2dcce58279 oneoffs walker to assess GLs at truth sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5083 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:59:05 +00:00
ebanks dfc5a3d1f3 added integration test for --sites_only option
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5082 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:58:15 +00:00
ebanks 0429301536 Added ability to output just sites (no genotypes) from UG with the --sites_only argument. Note that we do still genotype in this mode so that the INFO annotations are identical, but we strip the genotypes out of the VC right before writing to output. In other words, this is not designed to make UG go faster; the point here is to allow downstream tools not to have to parse GTs if they don't want to. Here you go, Ryan.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5081 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:52:38 +00:00
ebanks 01e032e89c Missorted BAMs are User Exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5080 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 14:09:39 +00:00
depristo be697d96f9 An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 13:45:40 +00:00
hanna 862b299b47 Fix Picard OTF index generation issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5077 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 03:42:46 +00:00
fromer 6ac888d26a Correct accounting for cases where first het in interval is phased
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5075 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:48:54 +00:00
fromer af79fa629f PROPERLY print out list of intervals and their stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5074 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:20:36 +00:00
delangel db2e2cb0ff Another trivial change to make VQSR work with indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5073 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 19:05:31 +00:00
fromer 17ba75e502 Can now print out list of intervals and their stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5071 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 18:36:59 +00:00
kshakir 9923e05e0a Moved MD5 utils from WalkerTest to BaseTest for use by PipelineTests.
Moved VariantEval validation from FCPTest to PipelineTest.
Cleaned up some duplicate code for writing temp files during tests.
Moved FCPTest to playground namespace to match move for FCP.q.
Added a basic HelloWorldPipelineTest for the HelloWorld QScript. 
Moved duplicated error handling from JobRunners into the FunctionEdge.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5068 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 04:11:49 +00:00
hanna 9db02059ac Fix for Ryan's issue: reads ending with indel distort the location of the
pileup, resulting a two map() calls for the same locus (and no map call for
the locus immediately following).
Fixed bug and added comprehensive unit tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5067 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 19:49:39 +00:00
fromer 61fe409211 Basic walker to count the number of (phased) hets in each exome target
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5064 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 17:53:14 +00:00
depristo c50f39a147 V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 16:45:07 +00:00
delangel fd864e8e3a Minimal necessary (but most likely not sufficient) changes to run VQSR on indel data: don't fill Ti/Tv fields if non-SNP, request VC only st start of position, check if isSNP() before doing snp-specific operations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5062 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-24 02:36:36 +00:00
depristo a51061fd96 Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 16:17:25 +00:00
ebanks 2d4bcb60a1 Don't print out alt alleles for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5060 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:33:31 +00:00
ebanks 2ba35dc7ba Bad chain files are user errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5059 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 06:04:36 +00:00
ebanks 2bbcc9275a Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 05:04:17 +00:00
ebanks bb6999b032 Better documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5057 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 03:36:09 +00:00