gatk-3.8

Commit Graph

Author	SHA1	Message	Date
rpoplin	dffa46b380	BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 20:53:57 +00:00
aaron	8fbc0c8473	fix for bug GSA-234: fasta index files couldn't handle anything but letters, numbers, or spaces in the contig name git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2147 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 19:19:47 +00:00
andrewk	3fca23cd16	Added a stub treeReduce function for debugging multi-threaded execution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2146 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 18:51:19 +00:00
rpoplin	277e6d6b32	Further optimizations of TableRecalibration. This completes my goal of having the only math done in the map function be addition, subtraction and rounding the quality score to an integer. Everything else has been moved to the initialize method and only done once. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2145 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 18:21:57 +00:00
andrewk	e4546f802c	Accumulates coverage across hybrid selection bait intervals to assess effect of bait adjacency. Requires input bait intervals that have an overhang beyond the actual bait interval to capture coverage data at these points. Outputs R parseable file that has all data in lists and then does some basic plotting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2144 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 18:12:34 +00:00
andrewk	e5106c9924	Hybrid selection performance statistics now include counts of the number of adjacent baits (0,1,2) using OverlapDetector and optionally include assayed bait quantities input via interval lists. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2143 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 18:07:23 +00:00
ebanks	87c1860398	I'm not sure I believe it, but JProfiler claims that calling FourBaseProbs.isVerbose() was taking 5% of my runtime... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2142 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 17:00:32 +00:00
ebanks	b3f561710f	Optimizations: 1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having). 2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler) UG now runs in half the time for JOINT_ESTIMATE model. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 16:27:39 +00:00
rpoplin	a59e5b5e1a	Added dbSNP sanity check to CountCovariates. If the mismatch rate is too low at dbSNP sites it warns the user that the dbSNP file is suspicious. Added option in CountCovariates and TableRecalibration to ignore read group id's and collapse them together. Also, If the read group is null the walkers no long crash with NullPointerException but instead warn the user the read group and platform are defaulting to some values. Default window size in MinimumNQSCovariate is 5 (two bases in either direction) based on rereading of Chris's analysis. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2140 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 16:16:44 +00:00
alecw	e5e6d515c3	Fix misunderstanding of GenomeLoc interval git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2138 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 15:12:49 +00:00
ebanks	cb6d6f2686	Very minor performance improvements git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2137 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 05:21:07 +00:00
ebanks	c90bea39a1	read.getReadString().charAt(offset) --> read.getReadBases()[offset] [As a courtesy I fixed all instances once I was updating GenotypeLikelihoods] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 04:25:19 +00:00
ebanks	ec321abd7b	Added ability to filter on the QUAL field git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2135 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 04:08:22 +00:00
ebanks	36d493e645	All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 03:55:12 +00:00
ebanks	ee5093d2c6	-Added VariantFiltration integration tests -Added integration test for GLFs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:36:27 +00:00
ebanks	be6a549e7b	Added the capability to allow expressions in an integration test command (i.e. -filter 'foo') by escaping them in the command. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2132 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:34:48 +00:00
hanna	4837fe919c	Convenience changes. If no -BWT option is specified, pull the BWT location from the reference. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2130 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 22:46:05 +00:00
rpoplin	9e4eadc37c	CountCovariates v2.0.2: Added a --process_nth_locus <int> argument to only use every Nth covered locus when creating the recalibration table. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2129 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 22:07:38 +00:00
ebanks	ed4cf3de57	Check that we're biallelic before calling isSNP() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2127 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:20:48 +00:00
rpoplin	5744a1d968	The covariates don't care about SAMRecord's anymore - Cleaning up the import statements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2126 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:10:12 +00:00
chartl	23983b2fd8	New annotation: ResidualQuality Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space. Update to the integration test so bamboo doesn't look as though someone murdered it with a spork git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:04:01 +00:00
ebanks	70059a0fc9	Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to. Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:03:38 +00:00
rpoplin	7f947f6b60	Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 19:51:36 +00:00
ebanks	14bf6ce83c	1. Newest version of the joint estimation model. Faster than previous version and now qscores can get to be > 39.8 for hets. 2. More sanity checks in annotations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2119 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 17:05:50 +00:00
hanna	ee2abd30c4	Count the best alignments and emit them to a file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2118 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 16:37:59 +00:00
rpoplin	1d46de6d34	The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 14:58:33 +00:00
ebanks	dfe7d69471	1. VCF: don't print slod if it's never set 2. UG: don't print slod if lods are infinite (todo: figure out a good guess instead) 3. UG: if probF=0 for 2 alt alleles are both 0 (because of precision), use log values to discriminate git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2116 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 02:55:43 +00:00
ebanks	753cb100a3	Add checks for weird situations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2115 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 02:14:25 +00:00
ebanks	04d6ac940c	Always print out VCF header - not just when there is genotype data present. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2114 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:44:10 +00:00
ebanks	bf935a6ab1	1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail. 2. Retired allele frequency range from UG, which wasn't very useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:31:48 +00:00
rpoplin	b24240664f	Reduced the number of calls to new ArrayList() in TableRecalibration. This results in a speed up of perhaps up to 6 percent (timed trials are hard). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2112 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-22 17:24:31 +00:00
hanna	c9c4999354	BWA: odds and ends. Get rid of some spurious debug code that was accidentally checked in. Add a better way to write out unmapped reads (thanks Kiran!) Add a pre-built version of the shared library to the repository for early adoption. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2111 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-22 15:26:07 +00:00
depristo	9c206abb97	removing unnecessary printing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2110 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-22 12:41:48 +00:00
chartl	59416ae06a	This is an annotation adapted from one that Mark Daly suggested some time ago. Right now it calculates: - For all reference bases, the proportion of their second best bases that support the SNP - the proportion of non-reference bases that support the SNP and reports the difference between the two. Initially I was taking depth into account as well, but that did not appear to work as nicely as I'd like (even at 20,000x depth, if 95% of the non-reference bases are C, and 98% of the reference second-best-bases are C, then we would want to be suspicious of it; but perhaps slightly less so than if the depth were only 20...) Anyway it's now available. I'm not sure how useful it will be, but I spawned the FHS annotation jobs again, so we'll see. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2109 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-22 00:47:49 +00:00
rpoplin	98f921fe24	The refactored CountCovariates now hashes the read object into a HashMap which holds all the properties the covariates pull out of the read over and over again such as read group string, bases string and its complement string, quality scores, etc. This results in a big speed up. CountCovariatesRefactored is now just slightly slower than CountCovariates (perhaps 1.07x according to my latest time trial). Thanks to Alec for suggesting IdentityHashMap. CycleCovariate now warns the user that is is defaulting to the Solexa definition of cycle when the platform string pulled out of the read is unrecognized instead of halting with an Exception. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2108 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-21 20:38:17 +00:00
depristo	27122f7f97	Performance improvements for pooled caller. Now possible to actually run on real data in a finite amount of time. Minor changes to GL interface (making strandIndex public) to support cached calculations in pooled caller. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2107 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-21 15:07:40 +00:00
ebanks	797bb83209	New VariantFiltration. Wiki docs are updated. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2105 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 19:50:26 +00:00
hanna	a78bc60c0f	Minor tweak to improve ease-of-use of iterator system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2104 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 18:24:19 +00:00
hanna	4fbb6d05d0	Refactoring. Push the revisions to the common aligner interface down into the aligner base classes. Hack the managed implementation to support the new interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2103 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 17:08:09 +00:00
ebanks	d84444200b	The Unified Genotyper now sorts the sample names in the vcf that it outputs. [There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 16:13:18 +00:00
hanna	38a030f2ba	Finishing off data transfer conduits for single alignment generator. Misc bug fixes elsewhere. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2101 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 15:21:59 +00:00
ebanks	2a5349d886	VariantAnnotator now adds dbsnp id if a dbsnp rod is supplied and it's not already set for a record git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2100 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 03:26:09 +00:00
ebanks	b434c1c240	Check for null entries before adding git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2099 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 03:12:20 +00:00
depristo	82fd824c4d	Continuing improvements to unified genotyper git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2098 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 01:39:29 +00:00
aaron	33dcfc858d	updates to the paper genotyper based on Mark's comments. There's still more work to do, including more testing. Also a 250% improvement in the getBases() and getQuals() of BasicPileup, which was nearly all of the runtime for the genotyper (using primitives instead of objects when possible). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2097 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 23:06:49 +00:00
rpoplin	22aaf8c5e0	Added the old recalibrator integration tests to the refactored recalibrator sitting in playground. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 22:43:28 +00:00
hanna	a95302fe98	Single alignment generator, another checkpoint. Does generate single alignments, but some of the data still needs to plumbed through and it may leak memory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2095 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 21:20:03 +00:00
hanna	a972b2769f	Checkpoint. Add first phase of single alignment interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2094 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 19:03:43 +00:00
aaron	6ba1f3321d	Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord. Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 18:17:47 +00:00
chartl	b4babb82eb	adding an extra bit of data to come out of CTT (number of chips with actual data) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2091 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 17:46:10 +00:00
alecw	7623b39927	Add rodPicardDbSNP git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2088 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 17:27:46 +00:00
alecw	b2b4ff7eca	Cache SAMReadGroup rather than get it twice git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2087 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 17:27:18 +00:00
depristo	eeb3a3fffb	comments for Aaron git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2081 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 12:56:04 +00:00
aaron	7997455f38	first go of the genotyper for the GATK paper. More testing and review tomorrow to call it done. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2080 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 07:55:24 +00:00
ebanks	7b957d3e2e	Make the whining from Khalid's office stop already git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2079 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 03:04:48 +00:00
hanna	85bc9d3e91	(Hopefully) temporary hack: load contig information by contig name rather than contig id to avoid off-by-one errors. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2078 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 23:33:27 +00:00
rpoplin	0fbd81766b	CountCovariates now uses any rod of type VariationRod with the name dbsnp as the source of known variant sites to skip over. It also grabs the platform string out of the read group when deciding which algorithm to use to calculate machine cycle. In this way it can now handle multi-platform bams. I added a new covariate: PositionCovariate. This is simply the offset regardless of which platform the read came from. This will be useful for comparing between the two covariates. Finally, this message serves as a warning that I will be killing the old recalibrator tomorrow after I've updated and verified new integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2077 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 23:03:47 +00:00
ebanks	f667bed7fc	-Don't annotate allele balance or on-off genotype if there's no genotype data -If qscore is infinity (because of precision) make a best guess instead git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2076 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 22:01:32 +00:00
ebanks	087e01a439	minor changes for --noSLOD git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2074 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 18:48:01 +00:00
ebanks	a70cf2b763	A bunch of changes needed to make outputting pooled calls possible git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2073 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 18:42:57 +00:00
ebanks	0a35c8e0ba	1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele. 2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense). 3. If in pooled mode, throw an exception of pool size isn't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:43:15 +00:00
chartl	405c6bf2c1	VariantEval genotype concordance for pools! Integration test coming soon git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2071 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:24:54 +00:00
depristo	6fe1c337ff	Pileup cleanup; pooled caller v1 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:03:48 +00:00
rpoplin	f0a234ab29	TableRecalibration is now much smarter about hashing calculations, taking advantage of the sequential recalibration formulation. Instead of hashing RecalDatums it hashes the empirical quality score itself. This cuts the runtime by 20 percent. TableRecalibration also now skips over reads with zero mapping quality (outputs them to the new bam but doesn't touch their base quality scores). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2069 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 16:47:44 +00:00
chartl	be31d7f4cc	Added - a walker that outputs relevant information about false negatives given a bunch of hapmap individuals and corresponding integration tests for it. This will output for hapmap variant sites: chromosome position ref allele variant allele number of variant alleles of the individuals depth of coverage power to detect singletons at lod 3 number of variant bases seen whether or not variant was called git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2068 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 15:47:52 +00:00
chartl	b68d6e06b7	Rollback of the previous "fix" and implementation of the real fix. We totally do want to annotate the call if called by another walker. Totally boneheaded misenterpretation of what the code was doing -- Eric, please forgive me for being an idiot. Instead, change the StingException to what it really should be -- an IllegalStateException, which is not coincidentally already handled by the calling function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2067 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 06:09:24 +00:00
chartl	95f1be94c0	Fix for the broken build: do not attempt to annotate if UnifiedGenotyper is called from another walker! Why this didn't break the build earlier I have no idea. Ultimately, there should be a better way of interfacing UG with another walker -- what if some other walker wants the annotations from UG? But since we're calling map directly -- and the annotations don't get returned directly from map -- this needs to be handled differently, while the map function should ultimately return the LOD score or quality under the GCM alone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2066 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 05:56:31 +00:00
ebanks	9fb50e9bd9	Further refactoring so that pooled calling will work. Okay, Mark, you should be all set. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2065 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 00:18:13 +00:00
chartl	539f6f15e5	Added -- Second base skew annotations and integration tests. Nothing need be given except -A SecondBaseSkew; the statistic it annotates calls with is a chi-square statistic given by the deviation of the observed proportion of reference second-best-bases from the expected 1/3. Future additions may be to ask that the deviation be instead from a given transition table. A big note for all users: All IllegalStateExceptions from the variation ROD (e.g. the RodGeliText) are dealt with SILENTLY. I understand this isn't optimal, but I'd rather simply not annotate a non-bi-allelic site than fail completely (there are quite a few such sites even on the regions over which the integration test has been written). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2064 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 00:11:13 +00:00
depristo	42a0bbaf46	Minor reformating for pooled calling git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2063 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 22:06:11 +00:00
rpoplin	ec1a870905	Working with byte arrays is faster than working with Strings so the Covariates now take in byte arrays. None of the Covariates themselves used the reference base so I removed it. DinucCovariate now returns a Dinuc object which implements Comparable instead of returning a String because it was too slow. CountCovariates now uses a read filter to filter out unmapped reads and allows the user to specify -cov all which will use all of the available covariates, of which there are 7 now. If no covariates are specified it defaults to ReadGroup and QualityScore, the two required covariates. Initial code in place to leave SOLID bases alone if they have bad color space quality. TableRecalibration uses @Requires to tell the GATK to not give the reference bases since they weren't being used for anything. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2062 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 21:50:52 +00:00
ebanks	4d9c826766	Integration tests actually run on real data now. <tries to hide sheepish grin> git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2061 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 21:04:14 +00:00
ebanks	a048f5cdf1	-Refactored JointEstimation code so that pooled calling will work -Use phred-scale for fisher strand test -Use only 2N allele frequency estimation points git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2059 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 20:21:15 +00:00
chartl	43bd4c8e8f	Ignoring deletions in the primary pileup by default was causing the primary pileup to become shorter than the secondary pileup when building up the secondary base pileup string. This fix makes sure to include the primary Ds within the pileup so that not only are the pileups guaranteed to be the same size, the same offsets will truly correspond with the same read. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2058 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 17:20:13 +00:00
aaron	aece7fa4c7	a convenience method to join a map into a single string, which I need for some VCF work. Added some documentation to the join method as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2057 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 16:50:01 +00:00
asivache	21729d9311	Do not print debug message when debug mode is not requested!! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2056 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 20:28:41 +00:00
rpoplin	967215066d	The old CountCovariates now warns the user if they didn't supply a dbSNP rod file. Thanks Kiran for the use case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2055 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 19:16:46 +00:00
rpoplin	eb07c7f7f8	CountCovariates now warns the user if they didn't supply a dbSNP rod file. Thanks Kiran for the use case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2054 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 18:44:54 +00:00
ebanks	4558375575	Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated. VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper. UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it). This is a fairly all-encompassing check in. It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout. All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing. Stage 2 of the process will happen later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:41:20 +00:00
hanna	ce5034dc5d	Finally reinstate the iterator-style interface. Get rid of some scaffolding code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2052 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:34:19 +00:00
kiran	103763fc84	An accessor for the VCF header git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2051 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-15 09:28:25 +00:00
kiran	97ed945797	Example code for a bug in the VCF implementation. See JIRA entry at http://jira.broadinstitute.org:8008/browse/GSA-225 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2050 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-15 09:27:12 +00:00
rpoplin	88fd762436	The -rf argument is now being used for read filter and is colliding with my walkers. Changed mine to -recalFile git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2048 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-14 19:37:46 +00:00
rpoplin	b05119987c	Clarified some of the comments in the individual covariates now that things have been moved around to speed up the code. In general most error checking and adjustments to the data are done per read instead of per base. This means that functionality was moved out of the covariate modules and into CovariateCounterWalker and TableRecalibrationWalker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2047 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-14 18:44:05 +00:00
rpoplin	672472789e	Added some documentation to the helper classes. Fixed an error case in TableRecalibrationWalker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2046 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-14 18:13:43 +00:00
hanna	15c14add4d	Repackage the aligner for better partitioning. The C aligner, for example, is now partitioned from the Java aligner, and both are partitioned from the more general- purpose BWT reader. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2045 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 22:55:27 +00:00
rpoplin	d1b525b428	Default window size for NQS covariate is 3 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2040 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 19:24:27 +00:00
rpoplin	394c839974	Implemented NQS covariate. Extended Cycle covariate to handle 454 and SOLID reads. Added a Primer Round covariate for SOLID reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2039 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 19:22:21 +00:00
ebanks	bf451873ff	1. Bug fix: check that AF=0 doesn't contain more probability than 1-fraction 2. Fix for Kiran: allow UG to call SNPs at deletion sites; we'll add an annotation to the VariantAnotator for deletions at the locus (next week). 3. Added integration tests for joint estimation model git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2038 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 18:02:18 +00:00
asivache	1be36ca959	Bug fix: when cleanedReadIterator is initialized, it gets immediately set to the contig of the first cleaned read; when the first uncleaned read coming in is on the lower contig, this would trigger 'readNextContig' with that lower contig as an argument. As the result, the whole cleaned reads file would be read through the end and no cleaned reads would be ever seen by the code afterwards. Now we do not call readNextContig if the (uncleaned) read's contig is lower than the current contig already loaded into cleanedReadIterator. the 'readNextContig' method now also throws an exception if requested contig is less than the currently loaded one git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2037 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 15:41:26 +00:00
rpoplin	b1376e4216	structure refactored throughout for performance improvements git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2036 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 15:41:09 +00:00
depristo	cff31f2d06	comments for eric git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2035 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 14:19:31 +00:00
aaron	234bb71747	changed the toVariation() method to take a reference base, instead of using the reference base loaded from the underlying data source (if it was reference aware). Also changed some isVariant() methods which weren't using the passed in ref base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2034 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 06:54:38 +00:00
ebanks	902cf84448	Bug fix: if the most likely allele frequency is 0, don't make a variant call (even if the Qscore for AF=1/n > threshold) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2033 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 04:10:32 +00:00
ebanks	555fb975de	1. Print out allele frequency range (from joint estimation model only). 2. Don't print verbose output from SLOD calculation (it's just a repeat of previous output). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2032 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 03:59:13 +00:00
mmelgar	72825c4848	A walker that generates a table of secondary base counts in a bam file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2031 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 02:11:23 +00:00
hanna	8145ed4672	Take 2, updating picard with bug fix for bam files containing no reads. Just stomped on the existing md5s because that's what Eric told me to do. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2029 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 22:52:08 +00:00
ebanks	61b5fb82ce	2 major changes: 1. Add dbsnp RS ID to VCF output from genotyper; to do this I needed to fix the dbsnp rod which did not correctly return this value. 2. Remove AlleleBalanceBacked and instead generalize the arbitrary info fields backing VCFs (and potentially others) in preparation for refactoring VariantFiltration next week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2028 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 22:51:49 +00:00
mmelgar	3742a05760	Now can read E2 or SQ tag. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2027 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 15:18:21 +00:00
aaron	c3c001e02e	cleanup of the traversal output code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 06:18:10 +00:00
ebanks	0922400ca9	Don't try to calculate ratios when DoC is zero (which happens when calls are made by an LD-aware genotyper) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2025 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 02:51:44 +00:00
ebanks	697d7e02c8	Remove the lazy initialize functionality. When no calls are made by the genotyper, we still want a vcf file to be output with valid header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2024 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 02:14:50 +00:00
hanna	2ea85fb62b	Fix some problematic command-line argument naming and descriptions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2023 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 02:12:26 +00:00
hanna	0c2a957ae0	Better configuration support. Now supports everything that people have expressed interest in except edit distance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2021 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 20:54:49 +00:00
depristo	6c9f86bb4d	Removed unnecessary output and added debugging print() routine git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2020 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 18:37:36 +00:00
ebanks	578dcc54a4	Don't create a record if ref=N git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2018 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 04:32:17 +00:00
hanna	8406325247	New Picard is breaking one of the integration tests. Revert until we find out whether the cause is legit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2017 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 03:59:32 +00:00
hanna	499e7d1d75	Push forward some more delicate merging routines. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2016 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 03:07:34 +00:00
hanna	bae4d3f7ea	Updated Picard with fix for Doug Voet. Thanks Alec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2015 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 02:01:08 +00:00
hanna	2e4782f202	Command-line arguments for SamReadFilters. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2014 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 23:36:17 +00:00
rpoplin	a13cbe1df0	The refactored recalibrator now passes the integration tests as well as my own validation tests. I'm ready to have other people start jamming on the files. I'll make an updated wiki page soon. The refactored recalibrator is currently a bit slower than the old one but there were a lot of great, easy ideas today for how to improve it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2013 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 22:20:06 +00:00
hanna	2cf9670d1e	Allow users to directly specify filters from the command-line, applicable to any walker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2012 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 18:40:16 +00:00
ebanks	6a37090529	Output changes for VCF and UG: 1. Don't cap q-scores at 99 2. Scale SLOD to allow more resolution in the output 3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 16:31:31 +00:00
rpoplin	1e7ddd2d9f	Added a validateOldRecalibrator option to CovariateCounterWalker which reorders the output to match the old recalibrator exactly. This facilitates direct comparison of output. Changed the -cov argument slightly to require the user to specify both ReadGroupCovariate and QualityScoreCovariate to make it more clear to the user which covariates are being used. Some speed up improvements throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2010 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 15:55:56 +00:00
depristo	7e30fe230a	oops, missing file git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2009 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 13:25:18 +00:00
depristo	d316cbad4c	VariantFilteration now accepts a VCF rod in addition to an input geli. It will then annotate this VCF file with filtering information in the INFO field too. --OnlyAnnotate will not write in filtering output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2008 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 13:24:58 +00:00
aaron	f9819d5f13	a little clean-up git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2007 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 06:18:34 +00:00
aaron	2ed423ed56	print the current location in read walkers (in addition to the number of reads processed), along with some refactoring to support the change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2006 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 05:57:01 +00:00
ebanks	c9c3cf477a	Based on feedback from Kiran, we know uniquify sample names as sample.rodName (instead of sample.1, sample.2, ...) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2005 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 02:41:37 +00:00
ebanks	2fa2ae43ec	Enough people have found this useful, so... Moving Callset Concordance tool to core and adding integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2003 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 20:59:18 +00:00
ebanks	3793519bd4	-Added convenience method to VCF record to tell if it's a no call and have rodVCF use it before querying for info fields -Don't restrict info fields to 2-letter keys [about to move these to core] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2002 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 20:52:51 +00:00
rpoplin	740a5484c4	Added some documentation to the code, mostly especially to CovariateCounterWalker but various comments added throughout. Also changed the HashMap data structure to accept an estimated initial capacity. This had a very modest improvement to the speed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2001 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 20:13:56 +00:00
ebanks	74751a8ed3	-Some minor fixes to get accurate vcf record merging done -Improvement to snp genotype concordance test And with that, it looks like I get revision #2000. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2000 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 06:40:55 +00:00
ebanks	ab705565cf	Completely refactored the Callset Concordance code. Now, it takes in VCF rods and emits a single VCF file which has merged calls from all inputs and is annotated (in the INFO fields) with the appropriate concordance test(s). Still needs a bit of polish... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1999 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 05:03:13 +00:00
ebanks	bc6f24e88f	Added VCFUtils which contains some useful VCF-related functions (e.g. ability to merge VCF records). Also, various minor improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1998 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:53:32 +00:00
ebanks	cff645e98b	convenience method to deal with genotypes that are unsorted (e.g. CA vs. AC) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1997 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:45:49 +00:00
kiran	7fde6c0bf4	One more output tweak. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1996 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:42:55 +00:00
kiran	00a7113d7a	Tweaks to formatting of output table. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1995 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:33:36 +00:00
ebanks	7ce0df76f8	Added accessors to the rod data sources so that walkers can access the name/file/type triplets for input rods. This is necessary if e.g. you want to create a vcf writer based on all of the samples being input. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1994 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:25:39 +00:00
ebanks	d07f3bb6f6	Added methods to get strand bias and to test if record has allele freq or bias fields set. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1993 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:20:35 +00:00
kiran	3313b0ddb4	Fixed a minor bug where the lodThreshold wasn't being printed in the header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1992 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 16:51:36 +00:00
kiran	95d381efe2	Optionally computes the error rate using the best base and a random base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1991 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 16:47:34 +00:00
kiran	567f5758d2	Optionally lists read depths by read group. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1990 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 16:39:19 +00:00
kiran	a679bdde18	FindContaminatingReadGroupsWalker lists read groups in a single-sample BAM file that appear to be contaminants by searching for evidence of systematic underperformance at likely homozygous-variant sites. Procedure: 1. Sites that are likely homozygous-variant but are called as heterozygous are identified. 2. For each site and read group, we compute the proportion of bases in the pileup supporting an alternate allele. 3. A one-sample, left-tailed t-test is performed with the null hypothesis being that the alternate allele distribution has a mean of 0.95 and the alternate hypothesis being that the true mean is statistically significantly less than expected (pValue < 1e-9). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1989 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 16:36:39 +00:00
kiran	2225d8176e	A convenience class for maintaining a dynamically growing table of values with access to the elements by named row and column identifiers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1988 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 16:34:35 +00:00
hanna	21c5f543fa	Fix sharding bug -- loci to which >100,000 (= 1 shard) reads are assigned an alignment start will confuse the sharding system and cause it to return duplicate reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1987 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-08 14:27:26 +00:00
rpoplin	84ba604611	Sequential quality score calculation is now in place in the refactored recalibrator and matches the quality scores calculated by the old recalibrator exactly; at least on the small sets of data used so far. Validation, documentation, and optimization work is on going. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1985 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-07 15:55:16 +00:00
depristo	bf1bc94060	Fixes for PooledConcordance bugs and lack of safety checking git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1984 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-07 01:54:10 +00:00
rpoplin	66d4a995e6	Initial check in of refactored Recalibrator. The new walkers are called CountCovariatesRefactored and TableRecalibrationRefactored. More work is needed to finish up the sequential calculation and to document the code sufficiently. These files are not ready to be used by other people quite yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1982 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-06 22:33:55 +00:00
ebanks	6fdfc97db6	Added optional field DP to VCF output for Mark. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1981 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-06 20:03:22 +00:00
ebanks	0a55fa5bb1	Completely refactored the Genotype Concordance module(s). Now PooledConcordance and GenotypeConcordance inherit from the same super class (and can therefore share data structures and functionality). Also, they now use ConcordanceTruthTable to keep track of necessary info. GenotypeConcordance passes integration tests. PooledConcordance needs to be finished by Chris. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1979 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-06 16:27:16 +00:00
ebanks	d549347f25	Refactored GenotypeLikelihoods to use an underlying 4-base model. It needs to be modified a bit and then hooked up to a pooled model, but that is now possible. At this point, there is no difference to the Unified Genotyper. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1978 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-05 21:59:25 +00:00
jmaguire	4d3871c655	don't flush anymore. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1977 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-05 19:11:51 +00:00
aaron	aacd72854f	a fix for a bug Andrey discovered: in read-based interval traversals we're dupplicating reads in rare cases. The problem was that to accomidate a bug in SAM JDK indexing, we were forced to add one to the stop of our QueryOverlapping() calls to ensure we always got all of the overlapping reads. Added a PlusOneFixIterator that wraps other iterators, and eliminates reads that start outside of our intended interval (interval stop - 1). Updated and checked BamToFastqIntegrationTest MD5 sums. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1976 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-05 05:26:33 +00:00
hanna	43c3ee61d5	Fix minor mapping quality bug. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1973 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-04 14:33:23 +00:00
ebanks	a545859c62	Joint Estimation model now emits a reasonable slod git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1969 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 21:12:42 +00:00
ebanks	11d950abe0	No longer allow the lod_threshold argument - use confidence instead. Have UG output qscores in all cases. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1968 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 16:18:51 +00:00
asivache	2fb45dbd73	Make window size a command line argument git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1967 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 16:13:35 +00:00
asivache	55f61b1f88	Bug fix in adjustment of the shift position. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1966 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 16:08:11 +00:00
depristo	5d5dc989e7	improvements to VCF and variant eval support of VCF -- now listens to the filter field git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1963 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 12:09:30 +00:00
hanna	c63af32fc7	The BWA/C bindings were triggering the local aligner to repeatedly reload the ref genome. Make sure the reference genome is cached. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1961 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 00:01:55 +00:00
ebanks	3a33401822	2nd stage of the genotyper output refactoring is complete. Now, all output is generalized and all of the intelligence lies where it is supposed to. Next stage is syncing up old and new models and making sure we're outputting exactly what we should. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1960 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 22:43:08 +00:00
aaron	ba67c7f02b	added a warning for those using bed files; we properly convert bed to the internal representation but the user needs to be aware that any output will be one-based closed intervals git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1959 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 21:09:18 +00:00
aaron	b71b66bd88	the underlying parameter is a float so we need to use Float.valueOf() instead; Noticed by external user Hou Huabin git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1958 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 20:22:25 +00:00
hanna	5a510e6d98	New PackageUtils interferes with the packaging utility. Revert until Aaron and I can get together to make this work. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1957 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 19:14:14 +00:00
aaron	de6ae51f7e	Scala walkers can now be build and run like any other walker in the GATK. Added the getUrlsForClasspath to PackageUtils, the Reflections package isn't getting the manifest files from jars in the classpath, and so we weren't seeing any walkers outside of the GenomeAnalysisTK.jar. A couple of notes: -Commented out BaseTransitionTableCalculator.scala because it's won't build; Chris could you fix this one (or kill it if it's not needed). -Removed the PrintReadsScala walker; moved the code over to a ScalaCountLoci walker (which is what the code was really doing). -Added configurations items to the ivy xml file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1956 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 06:02:41 +00:00
hanna	1896f334d9	Fixed collection of bugs in reads aligning to multiple locations. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1955 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 04:02:09 +00:00
ebanks	af6d0003f8	-Generalized the GenotypeConcordance module to deal with any number of individuals (although it will default to its old behavior if the -samples argument is left out). -Make rods return the appropriate type of Genotype calls from getGenotype(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1954 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-01 05:35:47 +00:00
hanna	b95165e39c	Make alignment (temporarily) part of main GenomeAnalysisTK.jar. Add some extra logging errors on failure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1953 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-01 00:33:18 +00:00
asivache	4b0796ba58	After fixing a few glitches and bugs, this version finally works as intended git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1952 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-31 04:59:58 +00:00
depristo	7d0ac7c6f2	Fix for long-term VariantEval bug plus new intergration test to catch it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1951 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-31 00:00:33 +00:00
asivache	ea8d5c7077	Some internal refactoring. Now "safely" ignores duplicate records (NOT duplicate reads but rather malformed bam files!) resulting from the bug/feature in CleanedReadInjector. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1949 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 17:50:51 +00:00
hanna	a3da475c88	Documentation and cleanup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1946 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 15:40:28 +00:00
hanna	2d15891719	Created walkers for alignment, validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1945 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 15:04:07 +00:00
ebanks	51fffc7f69	Comments for Ryan (which also apply to ReadQualityScoreWalker). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1944 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 14:44:04 +00:00
ebanks	ccd7440730	We can actually make this a bit simpler (and faster) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1943 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 04:21:03 +00:00
ebanks	1b6333e4ab	Enough people have asked for this that it just needed to get written. One can now split up any number of sets into an N-way Venn (although it doesn't check for discordance in the calls, so you'll still want to use SimpleVenn for 2-way comparisons). Wiki docs are updated. To do: update to use Ryan's generic hash map when it's ready for public use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1942 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 04:08:45 +00:00
ebanks	4bdb5b03bd	tell UnifiedGenotyper to return calls at all bases git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1941 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 03:10:44 +00:00
ebanks	4ee1d6f733	-Have the calculation models determine whether a call passes the lod/confidence thresholds (as opposed to returning everything and letting the UG decide); this way, walkers which call map() will get only the good calls. -Do the right thing in all models for all-base-mode (for Kiran). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1940 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 02:35:51 +00:00
ebanks	64ac956885	Okay, I caved in: CallsetConcordance now gets possible concordance types by looking at classes that implement ConcordanceType instead of having them hard-coded in. Thanks to Kiran this was pretty easy... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1939 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-30 00:32:26 +00:00
hanna	1f0d852a48	Fix bug where alignments with indels would be busted because bwa reverses the read bases to undo a previous read base reverse that doesn't occur in the libbwa codepath. Also fixed some memory management issues. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1938 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 21:33:13 +00:00
asivache	e3b4d4cbed	Genotyper reimplemented. Does the same thing, at least for now, but internal data structures redesign enables collecting various statistics for indel-containing/reference-matching reads. The statistics are not yet used by the caller itself to make a better judgement w.r.t. the validity of the calls it makes, but they are now printed into the output stream (--verbose). The statistics (for both normal and tumor) include: indel observation count/total coverage, av. number of mismatches per indel-containing and per ref-matching read, av. mapping quality, av. mismatch rate and av. base quality within an NQS windoew around the indel, numbers of indel and ref observations per strand. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1936 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 19:09:16 +00:00
hanna	f04b80d7db	Fixed epic memory leak. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1934 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 16:32:43 +00:00
ebanks	1c4ca9d383	-Mark just reminded me: actually force the ref/loc to be immutable -VCF writer should be blind to the score/confidence/lod value - just print the thing out as is git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1932 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 13:41:53 +00:00
ebanks	5cdbdd9e5b	now that the design is stable, pull the setReference and setLocation methods back out of Genotype and stick them into constructors of implementing classes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1931 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 13:27:37 +00:00
ebanks	3091443dc7	Sweeping changes to the genotype output system, as per several discussions with Matt & Aaron. Some things still need to be changed, but it will entail some more design decisions first (which means I get to bug M&A again tomorrow!). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1930 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 03:46:41 +00:00
depristo	86573177d1	Reverting rod walkers to use underlying refwalker implementation while we work on ROD2 and reenable the system. Added some serious sparse file parsing to variant eval tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1929 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 01:04:37 +00:00
hanna	c9a3707cfd	Initial version of BWA/C bindings. Still lots of squirrels roaming the code. - Some cigar strings aren't right. - Memory leaks. - BWA codebase changes aren't committed to BWA tree. - Aligner interface butchered to support BWA/C-style alignments. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1928 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 21:37:49 +00:00
chartl	c4359bc340	Whoops. Forgot the implements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1927 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 19:59:57 +00:00
aaron	5a3bd50537	adding error log reporting to the GATK, and a stream based output method for the argument collection git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1926 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 19:56:05 +00:00
chartl	863d3023d5	IndelCounterWalker -- a new little walker that counts indels over a region (want to see what kind of havoc BWA may be resulting in). Don't know when BasicPileup.indelPileup() was written, but kudos to whoever wrote it. BTTJ - remove 'N's from previous base analysis -- even if both read and ref are 'N' (which does happen, occasionally) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1925 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 19:50:50 +00:00
aaron	04e9a494e9	removed the GenotypesBacked interface, which is currently unused. Also cleaned up some documentation lines git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1924 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 18:08:14 +00:00
rpoplin	06ff81efe5	Added NeighborhoodQualityWalker.java and ReadQualityScoreWalker.java which are used to calculate a read quality score based on attributes of the read and the reads in the neighborhood. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1922 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-28 13:24:11 +00:00
depristo	68fa6da788	Initial graph-based reference implementation and alignment assessor. Not suitable for public use git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1921 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 21:54:47 +00:00
depristo	31d143a841	now only needs READS git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1920 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 21:54:14 +00:00
depristo	ef2ea79994	code cleanup and containsStartPosition function git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1919 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 21:53:40 +00:00
depristo	186a8dd698	Trivial protection for null value git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1918 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 21:52:52 +00:00
depristo	be333da9c0	charSeq2byteSeq -- convert a char[] to a byte[] for convenience git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1917 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 21:52:23 +00:00
chartl	4192b093b8	More robust error handling with parallelization + usePreviousBase. Added forceReadBasesToMatchRef to use in conjunction with nPreviousReadBases as a less stringent approximation of usePreviousBases (requiring previous pileups only had mismatches, and that read mapping quality be high was throwing everything away) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1916 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 17:20:44 +00:00
chartl	31d5df2859	Previous base now checks that the read matches the reference in the previous base window. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1915 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 15:58:20 +00:00
depristo	726378be8b	Almost ready to stop doing eagar decoding; waiting on Eric git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1914 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 15:28:05 +00:00
ebanks	e96b1791ab	Need to check for biallelic snp or exception gets thrown. Also, update to new tracker calls. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1913 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-27 02:43:43 +00:00
aaron	3fb3773098	a fix for traverse dupplicates bug: GSA-202. Also removed some debugging output from FastaAltRef walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1912 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-26 20:18:55 +00:00
hanna	a1e8a532ad	Support for initialize() and onTraversalDone() output from parallelized walkers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1911 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-26 20:18:31 +00:00
chartl	62c1001790	BTTJ is now correct. What a terrible waste of time, turns out I'd just reversed the header. Because of this the MD5 had to be updated in the tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1910 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-26 19:24:18 +00:00
sjia	24c7f694e6	Handles allele frequencies for any specified population, changed user input for mismatch filter options git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1909 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-25 22:51:56 +00:00
chartl	db9419df49	@ Hack to allow output from onTraversalDone() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1908 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-25 15:19:04 +00:00
ebanks	75ad6bbef7	Check that map isn't being called passing in null arguments. (This seems wrong; see JIRA entry GSA-211) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1907 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-25 02:30:36 +00:00
depristo	b4f55df600	Bugfix for Jason F git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1906 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-24 22:09:27 +00:00
hanna	65b98470f3	Temporary fix: have RodLocusView manage and close its RODs. Really the relationship between these two classes needs to be rethought; see JIRA GSA-207. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1904 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-23 16:00:12 +00:00
aaron	ad1fc511b1	intermediate commit for some changes in the Variation system, so Eric can go ahead with his changes. Everything is pretty set, but the Variation interface could use a convenience method that joins all the alternate alleles. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1903 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-23 06:31:15 +00:00
ebanks	6c338eccb8	Joint Estimation model now emits calls in all formats. The whole GenotypeCall framework needs to be changed, but this will work for the time being. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1902 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-23 03:07:28 +00:00
chartl	a6dc8cd44e	BTTC is now Tree Reducible allowing for parallelization. Integration test comment changed to reflect actual date of last md5 update. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1901 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-22 23:19:29 +00:00
hanna	2e552eb5a1	Validates intervals against sequence dictionary header bounds. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1900 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-22 19:31:15 +00:00
ebanks	54c61c663c	-Cleanup of the Joint Estimation code -Don't print verbose/debugging output to logger, but instead specify a file in the argument collection (and then we only need to print conditionally) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1899 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-22 15:25:29 +00:00
asivache	2cab4c68d4	Added method: isCodingExon(). Returns true if position is simultaneously within an exon AND within coding interval of any single transcript from the list. The old method of detecting coding positions as isExon() && isCoding() is buggy, as the position could be in the UTR part of one transcript (isExon() is true), and within coding region bounds (but not in the exon) of another transcript (isCoding() is true). As a result UTR positions would be erroneously annotated as coding. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1898 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-22 14:55:07 +00:00
chartl	af761fb9bd	Base transition table now forces epsilon/3 (three-state) model for the unified genotyper. Verified to be identical with changing the default model to being epsilon/3. This of course changes the observed counts, so the integration test has been updated. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1897 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 21:18:26 +00:00
ebanks	55fa1cfa06	-Renamed new calculation model and worked out some significant xhanges with Mark -Allow walkers calling the UG to pass in their own argument collections git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1896 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 20:49:36 +00:00
chartl	8e3f72ced9	BTTJ - Code refactoring (major) - passes integration test VariantEvalWalker - whoops, wrote PooledGenotypeAnalysis rather than PooledAnalysis, now passes tests again - PooledFrequencyAnalysis - don't bother initializing matrices if this isn't a pool git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1895 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 19:04:51 +00:00
depristo	15a1849758	notes for chartl git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1894 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 18:31:31 +00:00
chartl	77863d4940	@PowerBelowFrequency + Changes to doc @ BasicPoolVariantAnalysis + use char rather than ReferenceContext + calculate # alleles @ PooledFrequencyAnalysis + breakdown of call metrics by estimated number of alleles in pool @ VariantEvalWalker + add PooledFrequencyAnalysis to analysis set @ PooledGenotypeConcordance + correctly calculate maximal allele frequency for output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1893 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 15:17:11 +00:00
chartl	967128035e	Make command like args default to false. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1892 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 13:59:35 +00:00
ebanks	9b9744109c	Mark's new unified calculation model is now officially implemented. Because it doesn't actually use EM, it's no longer a subclass of the EM model. Note that you can't use it just yet because it doesn't actually emit calls (just prints to logger). I need to deal with general UG output tomorrow. Hold off until then, Mark, and then you can go wild. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1891 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-21 02:39:23 +00:00
depristo	caa3187af8	Enabling correct high-performance ROD walker and moved VariantEval over to it. Performance improvements in variantEval in general. See wiki for full description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1890 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 23:31:13 +00:00
chartl	4a8a6468be	Use read group as a condition for confusion tables. With an integration test. Changed BaseTransitionTable to comparable objects for consistent ordering of output ( e.g. so the integration test doesn't yell so much ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1889 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 19:39:32 +00:00
chartl	b83df5616a	Change for lower-case references (always compare upper case bases) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1888 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 17:36:31 +00:00
chartl	3b1fabeff0	Major code refactoring: @ Pooled utils & power - Removed two of the power walkers leaving only PowerBelowFrequency, added some additional flags on PowerBelowFrequency to give it some of the behavior that PowerAndCoverage had - Removed a number of PoolUtils variables and methods that were used in those walkers or simply not used - Removed AnalyzePowerWalker (un-necessary) - Changed the location of Quad/Squad/ReadOffsetQuad into poolseq @NQS - Deleted all walkers but the minimum NQS walker, refactored not to use LocalMapType @ BaseTransitionTable - Added a slew of new integration tests for different flaggable and integral parameters - (Scala) just a System.out that was added and commented out (no actual code change) - (Java) changed a < to <= and a boolean formula Chris git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1887 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 14:58:04 +00:00
aaron	4be6bb8e92	added a check to ensure the eval track variation is bi-allelic. Also changed some string constants over to enums. For some reason my check-ins from home wouldn't work last night, so this is the actual changes for 1884. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1886 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 14:15:33 +00:00
depristo	449a6ba75a	Deleting lots of code as part of my cleanup. More classes tagged for removal. Many more walkers have their days numbered. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1885 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 12:23:36 +00:00
aaron	d749a5eb5f	added a check to ensure the eval track variation is bi-allelic. Also changed some string constants over to enums git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1884 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 04:56:51 +00:00
ebanks	b8ab77c91c	Don't filter out reads without proper read groups. Instead, allow the user (or another walker calling UG) to specify an assumed sample to use (but then we assume single-sample mode). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1883 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 01:30:53 +00:00
depristo	a8a2c1a2a1	Replaced SSG with UG in packaging utils. Minor performance and formatting improvements for ClipReads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1882 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 01:19:58 +00:00
ebanks	c29924e7cf	Reverting previous change. Aaron, it's all yours... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1881 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 00:55:24 +00:00
aaron	d21b582b18	memory leak, where the Resource Pool was releasing based on the value and not the key, resulting in the resourceAssignments map growing with each additional shard git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1880 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 00:39:42 +00:00
ebanks	761a730758	assertBiAllelic -> assertMultiAllelic. Chris, if this breaks an integration test, you get it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1879 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-20 00:09:46 +00:00
depristo	2a26bb42dd	Softclipping support in clip reads walker. Minor improvement to WalkerTest -- now can specify file extensions for tmp files. Matt -- I couldn't easily create non-presorted SAM file. The softclipper has an impact on this. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1878 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-19 21:54:53 +00:00
chartl	055a99fb05	Change in ordering for a disjunctions. Walker will no longer try to calculate number of simple mismatches in the pileup if the pileup includes 'N's. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1877 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-19 18:24:14 +00:00
aaron	cfa86d52c2	ensure that in the indel case we don't allow identification as both an insertion and deletion at the same location in the VCF ROD git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1875 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-19 18:21:00 +00:00
chartl	3d50c72d74	Forgot a dumb little System.out.println. You will be flooded with "This read will not be used." statements until, overwhelmed, you give in to my demands. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1874 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-19 16:13:48 +00:00
chartl	225ef52973	Now produces same output as the Scala walker for unconditioned tables (no 2bb, no previous base, etc.) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1873 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-19 16:10:44 +00:00
ebanks	51f9ec0a5c	subtract largest posterior value from all values; this hopefully solves any precision issues git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1870 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-18 05:20:15 +00:00
ebanks	b9e8867287	-push allele frequency and genotype likelihood variable definitions down into the subclasses so that they can use different data structures -use slightly more stringent stability metric -better integration test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1869 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-18 04:22:17 +00:00
depristo	d6385e0d88	simpleComplement function() in BaseUtils. Generic framework for clipping reads along with tests. Support for Q score based clipping, sequence-specific clipping (not1), and clipping of ranges of bases (cycles 1-5, 10-15 for example). Can write out clipped bases as Ns, quality scores as 0s, or in the future will support softclipping the bases themselves. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1868 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 22:29:35 +00:00
chartl	ad777a9c14	@BasicPileup - made the counts public so they can be used @PoolUtils - split reads by indel/simple base @BaseTransitionTable - complete refactoring, nicer now @UnifiedArgumentCollection - added PoolSize as an argument @UnifiedGenotyper - checks to ensure pooled sequencing uses the appropriate model @GenotypeCalculationModel - instantiates with the new PoolSize argument git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1867 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 21:56:56 +00:00
andrewk	bdb34fcf38	Updated integration tests for VariantEval. Hooray for IT! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1866 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 20:00:29 +00:00
hanna	85a4fbc256	Bumping version of Picard for firehose compatibility. Integration tests were validated against svn rev 1861, before the wonder twins committed their changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1864 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 19:38:56 +00:00
aaron	8aacc43203	VCF output now emits no calls as ./. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1863 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 18:51:31 +00:00
andrewk	d1a4cd2f73	Added ValidationData analysis type to VariantEvalWalker; this eval takes a GFF file with validated truth data positions (bound to "validation")and calculates the accuracy of the genotype calls bound to "eval". git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1862 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 15:39:08 +00:00
ebanks	418e007ca6	A cleaner interface: now everyone can use UG's initialize method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1860 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 14:09:16 +00:00
aaron	96972c3a5c	a fix for a bug Eric found: if your first call contains fewer samples than calls at other loci, your VCFHeader got setup incorrectly. Also moved a buch of Lists over to Sets for consistancy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1859 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 04:57:50 +00:00
aaron	a69ea9b57c	Cleaning up the VCF code, adding lots of tests for a variety of edge cases. Two issues are still outstanding: updating the no call string with the standard 1000g decided on today, and fixing Eric's issue where not all the VCF sample names are present initially. also: their, I hope your happy Eric, from now on I'll try not to flout my awesomest grammer in the future accept when I need to illicit a strong response :-) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1858 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 04:11:34 +00:00
ebanks	b82c3b6040	Better error output (and fixed spelling mistakes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1857 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 01:01:45 +00:00
ebanks	993c567bd8	I had to remove some of my more agressive optimizations, as they were causing us to get slightly different results as MSG. Results in only small cost to running time. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1856 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-16 00:59:32 +00:00
asivache	7d7ff09f54	throw an exception if read has no associated read group git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1855 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 18:11:32 +00:00
chartl	b9544d3f89	Output formatting change (very slight) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1854 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 16:47:29 +00:00
hanna	839c5d66bc	Read uints directly into longs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1853 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 16:15:11 +00:00
hanna	ce38fa7c81	Breaking the signed int glass ceiling; stage 1: convert critical ints to longs. Code cleanup and documentation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1852 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 15:28:56 +00:00
kcibul	79993be46c	changed blank gene name to UNKNOWN git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1851 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 13:47:00 +00:00
depristo	0c2016c19a	Improved error messages -- now easier to read, points to the GATK Error Messages wiki, and avoids double printing of stack traces git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1850 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 12:07:44 +00:00
aaron	a9094c835c	clean-up and fixes to the VCF input git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1849 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 04:53:59 +00:00
ebanks	a32470cea1	Deal with the fact that walkers can call UG's init/map functions directly. We need to filter contexts in that case since the calling walkers don't get UG's traversal-level filters. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1848 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 02:31:45 +00:00
hanna	8dca236958	Base-packed reader cleanup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1847 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-15 01:26:23 +00:00
hanna	316b30ee56	On the road to human: make sure the suffix array will fit in a Java array. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1846 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 21:45:35 +00:00
ebanks	e740e7a7ce	Because walkers call UG's map function, we need to move the actual writing out to UG's reduce function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1845 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 20:49:26 +00:00
kcibul	825e6c7a4d	added calculation for bases over 2x,10x,20x,30x plus gene name git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1844 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 20:32:26 +00:00
aaron	727b69fce0	catch null output destinations earlier git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1843 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 20:07:15 +00:00
chartl	1f66738c8e	Fix a hashing function bug. Ignore reads with non-reference bases in the pileup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1842 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 19:41:26 +00:00
hanna	72c34f11dd	Bug fixing for BWA output formats. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1841 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 19:32:22 +00:00
aaron	60183229ab	the oldest java mistake in the book... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1840 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 19:32:13 +00:00
ebanks	52d2e0ca07	All walkers now use read.getReadGroup() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1839 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 19:27:40 +00:00
chartl	0a09fa4d5c	Rename to distinguish this transition table calculator from the scala version. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1838 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 18:52:21 +00:00
chartl	1d055011bd	Getting rid of this so I can rename it without the world blowing up. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1837 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 18:45:11 +00:00
aaron	eb90e5c4d7	changes to VCF output, and updated MD5's in the integration tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1836 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 18:42:48 +00:00
ebanks	89771fef05	-Use read.getReadGroup() -Add another filter for read groups for Chris git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1835 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 18:08:32 +00:00
ebanks	311ab8da5a	A helper class to create the masks for the sequenom design maker. This project is now officially done. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1834 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 17:28:51 +00:00
hanna	3553fc9ec0	Preparing for human -- support bwa output files directly rather than relying on a custom fixed sa interval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1833 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 17:17:46 +00:00
ebanks	0c95d6906f	Merge both versions of the Sequenom assay design maker: use Jared's base code and add in indels. [Jared, this still emits the same output for SNPs as your original version) Remove all sequenom stuff from the FastaAlternateReferenceMaker so it can just concentrate on making alternate references... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1831 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 17:11:45 +00:00
ebanks	49af5269e5	Jared: feel free to change or revert, but until we move over to UG version... Only print out positions with at least one non-ref call git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1830 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 17:08:57 +00:00
chartl	f5a2e6dd50	Fix! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1829 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 16:15:20 +00:00
ebanks	f2886d88e0	We now emit genotype calls git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1828 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 02:49:56 +00:00
ebanks	1b214c0de5	Fixed logic: throw exception if contigs are NOT equal git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1827 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 02:48:44 +00:00
ebanks	aeca14d052	On our side of 5CC, we spell multi M-U-L-T-I. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1826 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 01:41:25 +00:00
ebanks	c9c8fd1fef	Added the discovery LOD score to the meta data git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1825 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-14 01:24:06 +00:00
hanna	a76fac4687	Cleanup existing speedups. Minor performance improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1823 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 21:51:18 +00:00
hanna	837ae1d33a	Optimization: from 22k reads/min - 30k reads/min. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1822 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 20:59:29 +00:00
ebanks	96b8499a31	Remodeled version of the UnifiedGenotyper. We currently get identical lods and slods as MultiSampleCaller (except slods for ref calls, as I discussed with Jared) and are a bit faster in my few test cases. Single-sample mode still emulates SSG. The remaining to do items: 1. more testing still needed 2. we currently only output lods/slods, but I need to emit actual calls 3. stubs are in place for Mark's proposed version of the EM calculation and now I need to add the actual code. More check-ins coming soon... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1821 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 20:27:01 +00:00
ebanks	b28446acac	Multi-sample calls now have associated meta-data (SLOD, allele freq), which wil l soon actually be used... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1820 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 20:08:43 +00:00
hanna	db642fd08b	Optimization: from 10k reads/sec - 22k reads/sec.. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1819 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 18:07:15 +00:00
aaron	77499e35ac	fixes for GSA-199: Need easier way to write binary outputs to standard output. GLF and VCF now have stream constructors, and can get dumped to standard out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1818 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 15:50:20 +00:00
hanna	f37564e63a	Our BWA is now looking at roughly the same number of candidate alignments as BWA/C. Performance is now at 11k reads / min, still a long way from BWA/C. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1817 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 15:50:04 +00:00
chartl	8d0e057d83	I got bored today and decided to write the confusion matrix calculator. At present it is untested. I'm submitting it to subversion to make sure I have previous revision to revert back to. This is a calculator that will calculate: P[ True base is X \| read base mismatches, secondary base is Y, previous K bases are Z1,Z2,...ZK ] where the number of pervious reference bases to take into account is user-defined. The secondary base is optional as well. --usePreviousBases k tells the walker to use the k previous reference bases in the transition table --useSecondaryBase tells the walker to use the secondary base at a locus in the transition table these can be used together. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1816 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 02:55:29 +00:00
ebanks	be92a1e603	Don't try to close if the lazy initialize hasn't triggered git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1815 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-13 01:20:25 +00:00
chartl	ec83bc6ec5	This somehow didn't make it into subversion the last time. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1814 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 21:11:13 +00:00
chartl	ecbb11e017	Modified PowerBelowFrequency to ignore reads below a user-defined mapping quality. Request from Jason Flannick. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1813 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 20:59:24 +00:00
chartl	ec68ae3bc5	Added a filter that will split the read set by a threshold of mapping quality (Request from Jason Flannick) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1812 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 20:58:37 +00:00
chartl	0d73fe69e7	Recalibrator by NQS. Had this puppy running all afternoon. Thing had got through 100,000,000 reads before I decided to delete my sting tree. sigh, a little more delay. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1811 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 20:55:02 +00:00
chartl	ee0afba0af	Recalibration stuff... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1810 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 20:51:39 +00:00
ebanks	caf689821f	added method to get normalized posteriors git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1809 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 02:33:22 +00:00
ebanks	cf7a26759d	-use the getReadGroup() function that was added to picard for us -clean up some include lines git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1808 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-12 01:39:32 +00:00
hanna	d844d1c496	SAMFileWriters specified as command-line arguments were sometimes incorrectly altering the default short name. Make sure short name is not specified if shortName is not specified but fullName is. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1807 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 19:16:46 +00:00
hanna	da084357db	Fixed minor typo in output message. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1806 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 18:56:54 +00:00
aaron	62c484b57a	Fixes for GSA-201, where enumerated types in command line arguments had to be defined as all uppercase for the system to work. Also a little playground walker that changes the sort order flag of a BAM file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1805 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 18:11:32 +00:00
hanna	32d55eb2ff	Fix issue Eric was seeing with java.lang.Error in unmap0. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1804 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 17:46:56 +00:00
ebanks	9f3482ef11	VCF is both a multi- and single- sample format, so we shouldn't be throwing an exception when used for SS git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1803 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 17:43:26 +00:00
jmaguire	d9f5a314ac	avoid an out of memory error by no putting more than 5000 reads in the cache. on pilot1 at least those are crazy loci anyway. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1802 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 14:56:55 +00:00
hanna	f4b6afb42c	JVM issue id 5092131 (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5092131 ) was causing OOM issues with the new mmapping fasta file reader during large jobs. Temporarily reverting the reader until a workaround can be found. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1801 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 04:45:46 +00:00
chartl	6d7f4481e4	Changed traversal type slightly git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1800 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-09 04:11:48 +00:00
ebanks	a9f3d46fa8	Your time has come, SSG. Fare thee well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1799 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 20:27:56 +00:00
jmaguire	8fdb8922b8	now output in the exact format that works with sequenom software. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1798 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 20:06:27 +00:00
aaron	98e3a0bf1a	VCF can now be emitted from SSG. The basic's are there (the genotype, read depth, our error estimate), but more fields need to be added for each record as nessasary. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1797 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 19:50:04 +00:00
hanna	95f24d671d	Fixed 'visualization' of reads that didn't match bwa's alignments exactly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1796 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 19:45:30 +00:00
kiran	29ad6cd876	Made redundant by BCMMarkDupes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1795 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 18:47:20 +00:00
kiran	94d82d1915	Matthew Bainbridge's duplicate removal utility for 454 data. This code should eventually be moved into a read walker. For now, it's being introduced into the repository as-is (well, with one minor change to make the handling of command-line arguments a little more straightforward). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1794 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 18:32:37 +00:00
ebanks	15bf014e0b	logger.info -> logger.debug (don't want to risk filling up my log on genome-wide calls) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1792 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 17:53:11 +00:00
chartl	f89a89ffe3	Use of AlleleFrequency as an input to PowerAndCoverage is deprecated by the new walker. Reverting to the standard "power at 1 allele" calculation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1788 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 16:07:45 +00:00
chartl	ae05f5c7ad	Fixin the header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1787 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 15:49:28 +00:00
chartl	11ff1e09b8	A new power walker for the user to feed in a number of alleles. Call that number k. Output is: Locus Power_for_k_alleles Power_for_k-2_alleles Power_for_k-2_alleles ... Power_for_1_allele This was a request from Jason Flannick & the T2DB group. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1786 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 15:35:35 +00:00
ebanks	04fe50cadd	* We no longer have a separate model for the single-sample case. * For now, a single sample input will be special-cased in the EM model - but that will change when the EM model degenerates to the single sample output with a single sample as input. For now, the EM code for multi-samples isn't finished; I'm planning on checking that in soon. The SingleSampleIntegrationTest now uses the UnifiedCaller instead of SSG, and so should all of you. More on that in a separate email. Other minor cleanups added too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1785 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 14:08:57 +00:00
jmaguire	32128e093a	misc. changes to get the numbers back to the baseline while keeping the speedup. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1784 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 12:27:07 +00:00
jmaguire	d38a0d04b9	fix a snp mask offset error. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1783 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 12:25:40 +00:00
kiran	829e99413b	Rescores a variant after removing duplicates (defined very strictly as reads with the same start points). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1782 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 03:07:36 +00:00
hanna	fcb6a992c8	Switched IndexedFastaSequenceFile over to use memory mapping to load data rather than the loop-with-small block size. Performance improvements in loading refs are extreme; segments can be loaded in <1ms. chr1 in its entirety can be loaded in 1.5sec (down from 30sec). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1781 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-08 00:07:15 +00:00
jmaguire	02d2492d68	Simple tool for picking sequenom probes for SNPs. Can be extended to indels if necessary. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1780 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 23:46:41 +00:00
ebanks	1905b5defa	Hash by chromosome for now to reduce memory. This is a temporary solution until we decide how to reture the Injector for good. Also, with Picard's latest changes, we need to make sure we don't double-close the sam writer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1779 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 20:06:25 +00:00
ebanks	f9a1598d75	Reformatting git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1778 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 20:03:34 +00:00
ebanks	203c626fc2	A wrapper around the GenotypeLikelihoods class for the UnifiedGenotyper. This wrapper incorporates both strand-based likelihoods and a combined likelihoods over both strands. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1777 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 19:57:37 +00:00
sjia	5bdcc2b4dc	Included HLA class 2 genes in CreatePedFileWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1776 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 18:46:51 +00:00
sjia	8f896b734f	Included HLA class 2 genes in CreatePedFileWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1775 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 18:28:01 +00:00
aaron	f9a0eefe4b	GELI_BINARY is now functional, and can be used as a variant type in SSG (-vf=GELI_BINARY). Also fixed the max mapping quality column in both GELI output formats, we haven't been correctly outputing up until now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1774 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 18:20:34 +00:00
chartl	225b9bccc1	Modifications to NQSClusteredZScoreWalker to output empirical mismatch rates on bins by both Z-score and reported Q-score, rather than averaging over all Q-score bins for each Z-score. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1773 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-07 13:45:12 +00:00
depristo	8dd0924b37	Minor performance improvements to VariantEval -- now all of the CPU time is spent dealing with the ROD system... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1772 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-06 23:40:30 +00:00
aaron	4554ca1b28	more cleanup, depecaited the old genotype, corrected SNPCallsFromGenotypes' imports and two other classes that depend on it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1771 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-06 19:09:27 +00:00
aaron	3aec76136f	Removing the AllelicVariant interface, which is replaced by the Variation interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1770 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-06 17:44:24 +00:00
depristo	1bd0c3c145	variant eval allows non Variation rod objects git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1768 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-06 13:04:26 +00:00
aaron	66fc8ea444	GSA-182: Adding support for BED interval files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1767 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-06 02:45:31 +00:00
hanna	aec83b401d	SSG multithreading doesn't play well with some I/O changes made since I last svn up'd. Reverting until I can find the reason. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1766 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-05 19:48:57 +00:00
hanna	8a503c86b6	Code supporting SSG proof-of-concept shared memory parallelism. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1765 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-05 18:56:16 +00:00
ebanks	fb619bd593	-Refactoring: make GenotypeCalculationModel constructors empty so that they don't have to be updated every time we add a new parameter; instead put that logic in the super class's initialize method (making everything protected so that only the factory can access them) -Adding initial version of Multi-sample calculation model. This still needs much work: it needs to be cleaned up and finished. Right now, it (purposely) throws a RuntimeException after completing the EM loop. Also: -Fix logic in GenotypeLikelihoods.setPriors -Add logger to the models for output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1764 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-05 18:10:36 +00:00
sjia	98076db6b4	Modified CreatePedFileWalker to output PED file given HLA allele names git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1763 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-05 03:06:42 +00:00
hanna	56bc4fa21a	Fixed bug where not all alignments were returned if read aligned to multiple locations. Enhanced test suite to validate all alignments. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1762 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-04 18:20:20 +00:00
hanna	05aa928e3e	Fix off-by-number-of-deletions issue with negative strand reads. Improved performance by factor of 2.5x. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1761 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-03 21:55:18 +00:00
chartl	7605ee500c	Idiocy! All tests were being disabled because I forgot the instanceof git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1760 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-02 20:04:56 +00:00
chartl	88d0890cc3	Made PooledGenotypeConcordance a standard test in VariantEval git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1759 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-02 20:03:31 +00:00
aaron	7fc4472e6d	A big fix for MergingSamRecordIterator, where we weren't correctly handling the comparisons of SAMRecords correctly (we weren't applying the new reference index first, so sometimes the MT contig would be ID 23, sometimes 24 in different records). Also a fix to the GLF tests, and a correction to PrintReadsWalker to remove the close() on the output source, the source handles that itself (and you get a double close). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1758 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-02 19:35:35 +00:00
chartl	68cb2ee54b	Tweaks to parameters for NQS analysis walkers; change to PowerAndCoverage for Jason Flannick (can input the number of alleles to compute power for - i.e. doubletons, tripletons; rather than statically checking singletons. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1757 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-02 19:11:27 +00:00
ebanks	53a4bd7f51	A better understanding of what's going on means no need for clearing the cache git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1755 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-02 18:07:46 +00:00
aaron	e885cc4b21	changes for corrected GLF likelihood output, along with better tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1754 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-01 20:45:05 +00:00
hanna	2309d19f6f	Bug fix from Michael Ross: mark second read in sequence as second of pair. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1753 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-01 14:34:36 +00:00
aaron	2e4949c4d6	Rev'ing Picard, which includes the update to get all the reads in the query region (GSA-173). With it come a bunch of fixes, including retiring the FourBaseRecaller code, and updated md5 for some walker tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1751 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-30 20:37:59 +00:00
ebanks	303972aa4b	Yup, I broke the build... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1750 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-30 20:20:43 +00:00
ebanks	841d25cc44	Added ability to set the priors after construction (and requiring a flushing of the likelihoods cache) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1749 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-30 19:55:49 +00:00
hanna	665951f9f0	Support negative strand alignments. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1748 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-30 18:10:26 +00:00
hanna	d3b1732cca	Start of refactoring effort. Make construction of alignment object simpler. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1747 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-30 15:19:31 +00:00
hanna	70e1aef550	Better integrate the @ArgumentCollection into the command-line argument parser. Walkers can now specify their own @ArgumentCollections. Also cleaned up a bit of the CommandLineProgram template method pattern to minimize duplicate code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1746 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 22:23:19 +00:00
aaron	b1c321f161	Adjusted Genotype concordance to more accurately use the new Genotyping code, fixed the VCF rod, and temp. fix the build by reintroducing Shermans ReadCigarFormatter git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1745 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 21:28:21 +00:00
sjia	9b78a789e2	HLA Caller 2.0 Walkers: CalculateBaseLikelihoodsWalker.java walks through reads calculates likelihoods using SSG at each base position CalculateAlleleLikelihoodsWalker.java walks through HLA dictionary and calculates likelihoods for allele pairs given output of CalculateBaseLikelihoodsWalker.java CalculatePhaseLikelihoodsWalker.java walks through reads and calculates likelihoods score for allele pairs given phase information File Readers: BaseLikelihoodsFileReader.java reads text file of likelihoods outputted by SSG FrequencyFileReader.java reads text file of HLA allele frequencies PolymorphicSitesFileReader.java reads text file of polymorphic sites in the HLA dictionary SAMFileReader.java reads a sam file (used to read HLA dictionary when in another walker) SimilarityFileReader.java reads a text file of how similar each read is to the closest HLA allele (used to filter misaligned reads) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1744 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 20:45:55 +00:00
chartl	281a77c981	Bugfix. isMismatch() was actually computing isMatch(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1743 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 20:04:59 +00:00
chartl	e28b45688c	More NQS Related Walkers to play with git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1742 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 20:01:04 +00:00
ebanks	9ef80e3c3c	One minor addition: to incorporate Pooled calling (and to be as general as possible), we allow the genotype calculation model to use rods if it wants. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1741 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 17:05:59 +00:00
ebanks	19bfe43173	First pass at a unified caller, being checked in now so Mark can give feedback if he chooses and so Matt can debug issues with the ArgumentCollection class. Some notes: 1. This design should be flexible enough to include pooled calling (for now) after discussions with Chris. 2. Using the unified caller with the SingleSampleCalculationModel emits the exact same output as SSG over all of chr20 for NA12878. Additionally, when we include the "max deletions allowed at a locus" argument (so we don't try to call SNPs at deletion sites), it removed 233 SNP calls in chr20 that were clearly indel artficts. 3. The MultiSampleEMCalculationModel is still a work in progress and will be checked in later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1740 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 16:48:15 +00:00
ebanks	8bd345ba00	Generalized deletions in pileup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1739 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 15:58:43 +00:00
andrewk	6134f49e3c	Convert de novo SNP caller to run using parent1 and parent2 BAM files (by splitting contexts by reader using getMergedReadGroupsByReaders) instead of geli files providing a large speed-up and obviating the need for large whole-genome geli files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1738 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 06:42:21 +00:00
andrewk	5dab95aa5a	Fix getMergedReadGroupsByReaders so that it provides read groups in the same way Picard does so that it works correctly when input read files have no clashes in their read groups and retain their original read group names. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1737 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-29 06:35:50 +00:00
andrewk	5662a88ee1	Cosmetic change to list sampling functions: the typical usage of n and k were reversed. No change in functionality of the classes has been made and unit tests still pass. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1736 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-28 18:12:32 +00:00
aaron	39598f1f0a	switching the concordance walker over to the new Variation system git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1735 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-28 15:46:36 +00:00
asivache	bce2f0d7cf	Now instantiates the list of alternative consenses to evaluate as LinkedHashSet to guarantee iterator traversal order. Old implementation used HashSet and exhibited unstable behavior when two alt consenses turned out to be equally good: depending on the run conditions (including size of the interval set being cleaned??), either one could be seen first as selected as the 'best' one git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1734 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-28 06:15:46 +00:00
asivache	663175e868	Bug fix: when jumping onto next contig (chromosome), the walker was erasing last mismatch interval from the previous chr it was still holding without printing it; now it gets printed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1733 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 22:24:34 +00:00
asivache	92c6efabb7	moving IndelGenotyper out of playground git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1732 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 19:44:49 +00:00
asivache	aec61c558b	moving IndelGenotyper out from playground git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1731 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 19:43:53 +00:00
chartl	fe6d810515	Some basic commits that I've been sitting on for a while now: @ PooledGenotypeConcordance - changes to output, now also reports false-negatives and false-positives as interesting sites. It's been like this in my directory for ages, just never committed. @NQSExtendedGroupsCovariantWalker - change for formatting. @NQSTabularDistributionWalker - breaks out the full (window_size)-dimensional empirical error rate distribution by the window. So if you've got a window of size 3; the quality score sequences 22 25 23 and 22 25 24 have their own bins (each of the 40^3 sequences get one) for match and mismatch counts. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1730 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 19:35:50 +00:00
sjia	f7684d9e1b	ImputeAllelesWalker fills missing portions of HLA dictionary based on best allele matches git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1729 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 18:51:46 +00:00
sjia	235de38c2e	Updates to FindClosestAlleleWalker and CreateHaplotypesWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1728 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 16:41:58 +00:00
aaron	2b7d39035a	switched over the FastaAlternateReferenceWalker to the Variation system git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1726 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 16:09:43 +00:00
aaron	7ffc1d97ef	Cut DeNovoSNPWalker over to the new Variation system, some renaming of methods on the Variation interface, and some corrections on the interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1724 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-25 04:35:52 +00:00
depristo	392152f149	1000x performance improvements to MSG for crisis control git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1723 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 23:44:33 +00:00
hanna	44879c81b0	Add in weights. Massive performance improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1722 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 23:19:15 +00:00
hanna	3b79f9eddc	Support 'N's and other mismatch characters in the reference. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1721 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 21:41:30 +00:00
hanna	08e8d2183a	Indels supported. Variable gap penalties are not yet taken into account. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1720 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 21:03:02 +00:00
aaron	d2af26e81f	Pooled EM SNP Rod converted over to the Variation interface git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1719 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 16:33:11 +00:00
ebanks	97105ac001	We need to return a null RODRecordList when the default value is null (as opposed to a list with a single null value), because that's what everyone is expecting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1718 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 16:23:12 +00:00
ebanks	d4b40bc06f	Filter for reads with missing read groups so we can safely assume all reads have valid read groups git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1717 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 16:10:26 +00:00
ebanks	90de2e0cde	Added ability to specify whether you want to use a point estimate or fair coin test calculation; for now you can use either but fair coin test is still experimental as it needs to be parametrized correctly. This job will hopefully be done by the future Bioinformatic Analyst... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1716 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 15:29:50 +00:00
aaron	d262cbd41c	changes to add VCF to the rod system, fix VCF output in VariantsToVCF, and some other minor changes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1715 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 15:16:11 +00:00
sjia	1ee8ba590c	Reads cigar files git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1713 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 03:14:10 +00:00
sjia	9422156e09	Finds closest allele for each read in bam file git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1712 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 03:12:20 +00:00
sjia	5c5151c4e7	Creates ped file from reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1711 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-24 02:48:29 +00:00
hanna	b0ec7fc144	More comprehensive testing of BWT (mismatches only) module, and lots of bug fixes. Limitations: 1) Can't handle RC alignments. 2) Can't handle indels. 3) Can't handle N's in reference bases. 4) Stops at first hit. Ran BWT over a test suite of 800k Ecoli reads. After removing alignments with indels / reads with Ns, the remaining reads were aligned with quality 'equal to' that of the alignment stored in the BAM file. In this case 'equal' quality is <= mismatches to the reference as the existing alignment stored in the BAM file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1710 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 23:44:59 +00:00
sjia	b446b3f1b6	CreateHaplotypeWalker now gives correct output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1709 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 21:13:52 +00:00
aaron	eeb14ec717	a couple of light changes to GenomeLocSortedSet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1708 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 20:38:53 +00:00
sjia	3916e165fb	New walker to output haplotypes for each read (for SNP analysis or imputation, etc) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1707 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 20:26:43 +00:00
ebanks	423a3ee894	Added a sequenom rod to empower Carrie to convert 1KG validation SNPs to sequenom format git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1706 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 20:22:09 +00:00
chartl	63f3d45ca4	fixing the build git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1705 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 20:04:09 +00:00
chartl	540e1b971f	And we fix one boneheaded mistake, which was actually causing the problem; though the last change was still correct. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1704 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 19:26:45 +00:00
chartl	124ca68fa8	And an IMMEDIATE minor fix (want neighborhood quality > base quality to be represented correctly) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1703 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 19:21:09 +00:00
chartl	8cdb78ebee	More sophisticated version of the NQSCovariantWalker - modified to be more explicit about how much higher the quality score of a particular base is than the quality score of its neighbors. The granularity of the binning jumps from 32 groups to 860 groups. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1702 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 19:18:24 +00:00
hanna	856bbd0320	Let Picard specify the default compression level. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1701 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 19:01:48 +00:00
aaron	f783cb30e0	adding an interface so that the current @Requires with ROD annotations work in walkers like VariantEval git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1700 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:24:05 +00:00
hanna	ebfbe56b43	Make sure compression level always gets pushed into SAMFileWriterFactory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1699 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:20:26 +00:00
asivache	fa87dd386d	Now uses rodRefSeq in its new reincarnation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1698 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:19:36 +00:00
asivache	bf7cd66d53	New, simpler rodRefSeq. Fully relies on the ROD system standard mechanisms. Multiple transcripts over a given location will be now returned by the ROD system itself as RodRecordList<rodRefSeq>; and yes, rodRefSeq does represent a single transcript record now and implements Transcript interface git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1697 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:18:25 +00:00
asivache	8fa4c93f5a	Transcript is now simply an interface git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1696 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:13:31 +00:00
asivache	fe36289e44	Noone needs this, probably... Old experimental code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1695 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:11:50 +00:00
asivache	1bd4c0077c	Now that ROD system supports overlapping RODs, we do not need rodRefSeq to be too smart and read in all the overlapping records (transcripts) on its own; leave it to the generic ROD mechanism. PARTIAL commit; new, simpler rodRefSeq will reappear in a seq. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1694 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 18:11:16 +00:00
sjia	aa66074a0e	Compares each read to the HLA dictionary and outputs closest allele, as well as other stats git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1693 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-23 16:17:23 +00:00
aaron	11c32b588f	fixing VariantEvalWalkerIntegrationTest md5 sums, a couple comment changes, and a little bit of cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1690 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 20:54:47 +00:00
ebanks	0748d80baa	Added a convenience method in rodDbSNP to deal with Andrey's changes to the rod. Now you can just ask for the first real SNP rod from the list and not have to think about how it works. CountCovariates uses it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1688 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 20:15:40 +00:00
hanna	14477bb48e	Unidirectional alignments with mismatches now working. Significant refactoring will be required. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1686 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 19:05:10 +00:00
sjia	22932042ea	Combined Scores, bug fixed for printing HLA-C git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1685 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 18:28:16 +00:00
ebanks	682b765536	bug: need to upper case chars so that == works throughout git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1684 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 18:20:43 +00:00
asivache	d7d0b270d1	now supports blacklisting lanes (with -BL option will ignore reads from any of the specified lanes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1682 348d0f76-0448-11de-a6fe-93d51630548a	2009-09-22 16:46:57 +00:00

... 6 7 8 9 10 ...

2070 Commits (c8c5c176cd8e4fca1db3ecb5cfda7807a1fbf649)