gatk-3.8

Commit Graph

Author	SHA1	Message	Date
aaron	6941c81bfa	reverting revision 3522 to the old code until we fix the tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3524 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 19:25:02 +00:00
weisburd	adc4c4e577	Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3522 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:11:43 +00:00
aaron	ad98512f6c	adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:21:12 +00:00
ebanks	9b2fcc4711	Refactoring of the annotation system: 1. VA is now a ROD walker so it no longer requires reads (needs a little more testing) 2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs) 3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:05:51 +00:00
depristo	e2b41082af	GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 22:26:32 +00:00
ebanks	4a555827aa	Removing more toUpperCase sanity checks git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3471 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 14:38:39 +00:00
depristo	2b02324587	Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-30 18:00:12 +00:00
aaron	871cf0f4f6	Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of: @Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class)) you'd say: @Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class)) Which is more in-line with what was done before. All instances in the existing codebase should be switched over. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-28 14:52:44 +00:00
depristo	f2e7582cfc	Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-27 23:16:00 +00:00
aaron	cded9ec985	adding a command line option, -etd (enable threaded debugging), that uses a custom thread pool class to catch exceptions thrown inside of a thread. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3450 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-27 21:57:56 +00:00
depristo	dfc36c1e95	Restructuring of the mandatory read filters for traversals. Now everything uses ReadFilters, even for the required filters like being mapped for LocusWalkers. Statistics now tracked for each read filter used during the traversal and info emitted in INFO at the end. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3445 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-26 22:12:25 +00:00
depristo	5928047d8b	Optimization of reference window calculation to us bytes not char and no uppercasing since reference and read bases are always uppercase now. Should remove some ~5% of runtime of UG. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3438 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-26 14:10:26 +00:00
ebanks	ae6c014884	Fixed UG parallelization bug. Better integration test to catch this in the future. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3432 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-25 21:03:45 +00:00
ebanks	772f558ae0	Massive change to the indel realigner code. We now properly deal with soft-clipped reads. Also, improved left-alignment code. Small change for Ryan to get hard-clipped reads working for the recalibrator. PLEASE DO NOT RELEASE THIS WEEK. I still have some more testing to do and need Mark to run WG jobs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3430 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-25 20:04:33 +00:00
depristo	a10fca0d5c	Genotyper now is using bytes not chars. Passes all tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3406 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 21:02:44 +00:00
aaron	b543dd4ac4	more aggressive checks for the locking, and some more documentation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3404 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 16:16:36 +00:00
depristo	727822adb4	BaseUtils has more clear distinction between byte and char routines. All char routines are @Depreciated now. Please use bytes. Better organization of reverse(), now in Utils not BaseUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3400 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 14:05:13 +00:00
depristo	6ce3835622	Removing unused methods in QualityUtils; ReferenceContext now converting all bases to upper case, but can be disabled with static boolean git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3399 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 12:38:06 +00:00
depristo	5abac5c057	A few more char -> byte cleanups git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3398 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 00:02:06 +00:00
depristo	8a725b6c93	Restructuring of ReferenceContext and ReadWalkers to accept a ReferenceContext. Now ReferenceContext is byte[] backed not char[]. Please no more chars for the reference. All of the tests pass now. Coming check-ins are going to clean up the char / byte problems in the GATK git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3397 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 23:27:55 +00:00
hanna	017ab6b690	Experimental versions of downsampler and Ryan's deduper are now available either as walker attributes or from the command-line. Not ready yet! Downsampling/deduping works in a general sense, but this approach has not been completely optimized or validated. Use with caution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 05:40:05 +00:00
weisburd	2f3933148d	Added fast split(str, delimiter) methodf git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3384 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 03:37:26 +00:00
aaron	7cfb9ff3dc	updates for Tribble 82, fixes for Ryans case where multiple processes would attempt to read/write to the same index, and a couple other Tribble-centric bug fixes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3382 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-18 19:34:45 +00:00
hanna	0791beab8f	Checking in downsampling iterator alongside LocusIteratorByState, and removing the reference implementation. Also implemented a heap size monitor that can be used to programmatically report the current heap size. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3367 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-17 21:00:44 +00:00
aaron	2c55ac1374	fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-12 06:13:26 +00:00
hanna	76efa757f0	Switched over to reviewed version of Picard patch. In process, did some optimization to the IntervalSharder which improved startup time 5-10x when dynamically merging many BAMs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3331 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-08 14:12:22 +00:00
depristo	504103bd15	Misc. additions to correct utilities git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3329 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 21:34:18 +00:00
aaron	06ea65e60b	again for JIRA GSA-320 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3319 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 03:47:58 +00:00
aaron	ac9b32db88	a bug fix for Kiran; putting JIRA in for better type determination system for the new Tribble tracks so this doesn't happen again. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3318 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 03:31:43 +00:00
hanna	4e0019b04f	Repair code that sorts and merges intervals. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3317 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 22:37:25 +00:00
ebanks	0e58fb7cc0	Moved over to be a walker inside the GATK git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3313 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 18:28:03 +00:00
aaron	78409dca0d	turned off the progress output from tribble when making an index, and fixing a case where the index file isn't writable so we instead make the index in memory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3312 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 16:36:58 +00:00
ebanks	bacc507a48	Don't worry about sorting anymore in the liftover tool. That will come later. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3311 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 15:00:30 +00:00
ebanks	2975e3a4e8	picard Intervals don't sort right - switching to GenomeLocs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3308 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 03:50:28 +00:00
ebanks	1a99fb9318	First pass at liftover tool. Passing buck over to Aaron... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3306 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 20:38:19 +00:00
aaron	a0d71540df	speed-up for VCF, adding code to the VCF reader to automagically make an index if one doesn't already exist, and a change to the VCF writer unit test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3305 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 20:19:42 +00:00
aaron	6bbcc47b5d	removing some out-of-date RODs and some unused genotype writer formats git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3304 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 19:07:13 +00:00
aaron	a68f3b2e9c	VCF moved over to tribble. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3302 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 17:28:48 +00:00
ebanks	64640d6b17	Complete the switch statement to deal with all possible cigar operators for Kris. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3299 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 13:41:05 +00:00
weisburd	8b2ce128b5	Optimized the join(..) method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3280 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-30 15:55:07 +00:00
aaron	64c5f287c5	fixes for edge-cases when using reflections to find classes outside of the main jar. Will push as a patch to reflections git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3264 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 17:46:46 +00:00
aaron	c647153b10	Adding Jama for Ryan. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3262 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 14:30:36 +00:00
aaron	f6468f9143	a fix for a bug we've worked around in the reflections package: previously it didn't find classes that weren't in the main jar. Fixed in this version. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3261 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 04:49:49 +00:00
ebanks	42bcca1010	Pulling out the left-alignment code for indels so that other walkers can use it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3251 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-23 16:12:34 +00:00
aaron	536f22f3bd	adding VC adaptor for GELI, along with unit tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3243 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-23 05:28:39 +00:00
hanna	32d86cf457	Rev the reservoir downsampler to support partitioning through a functor. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3232 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-21 19:50:26 +00:00
asivache	1373fee278	Because of the ugly VCF format, generic addCall() method of GenotypeWriter interface acquired an additional parameter, explicitly specified reference base (in VCF it's the base immediately before the event in case of indels, so we got to pass it). All implementing classes are modified to accomodate the change. VCFGenotypeWriterAdapter now explicitly uses the passed reference base instead of deriving it from VatriantContext (in SNP mode as well!), other writers simply ignore that additional argument. SimpleIndelCalculationModel now WORKS (or rather, it does produce calls :) ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3228 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-21 18:19:03 +00:00
asivache	6fda78f93f	Always return deleted bases in upper case git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3218 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 19:17:40 +00:00
asivache	52a570637d	Always keep event bases in upper case git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3217 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 19:16:39 +00:00
aaron	80c4f88a72	removing the Variation interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3216 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 18:56:45 +00:00
hanna	c1e53d407d	The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had unicode quote characters embedded in it. These characters were invisible inside IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason have a different default charset. Fixed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 15:26:32 +00:00
aaron	b5f6f54968	Almost done removing any trace of the old Variation and Genotype interfaces. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3202 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 14:52:15 +00:00
hanna	1bc26f69e9	An attempt to cleanup the Utils directory. Email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-19 23:00:08 +00:00
hanna	c08936d6f4	Added a reservoir downsampler which can sample elements in an iterator uniformly from a stream (see Vitter 1985). Thanks to Eric and Andrey for the pointer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3197 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-19 20:48:14 +00:00
aaron	e11ca74eb5	removing some outdated ROD classes (PooledEMSNPROD and SangerSNPROD), removing an out-of-date interface (VariantBackedByBenotype), and moving AnalyzeAnnotationWalker over to VariationContext. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3188 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-16 18:59:29 +00:00
asivache	6dc1275cfb	Utility method added: getQualsInCycleOrder(read) - examines the read and returns its quals in the order the machine read them (i.e. always from cycle 1 to cycle N). Simply inverts quals if the read happens to be rc-aligned :) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3183 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-16 00:15:57 +00:00
aaron	e682460c1f	add a fix so that XL arguments won't cancel out -BTI arguments, fixed a bug for Ben where the ROD -> interval list conversion was throwing an exception, and some old code removal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3174 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-15 16:31:43 +00:00
hanna	8573b0bc6f	Refactoring intervals, separating the process of parsing interval lists, sorting and merging interval lists, and creating RODs from intervals. This gives Doug the ability to keep using our interval list parsing code when sorting intervals on our behalf. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3159 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-13 15:50:38 +00:00
ebanks	3f2455e346	Better error message as suggested by James P git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3141 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-09 05:52:53 +00:00
aaron	12e4f88ca7	a little bit more clean-up git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3122 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-05 20:49:06 +00:00
aaron	df7e7921ce	removing some unused code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3121 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-05 19:30:08 +00:00
bthomas	b4f6f54502	Reorganizing the way interval arguments are processed Most of the changes occur in GenomeAnalysisEngine.java and GenomeLocParser.java: -- parseIntervalRegion and parseGenomeLocs combined into parseIntervalArguments -- initializeIntervals modified -- some helper functions deprecated for cleanliness Includes new set of unit tests, GenomeAnalysisEngineTest.java New restrictions: -- all interval arguments are now checked to be on the reference contig -- all interval files must have one of the following extensions: .picard, .bed, .list, .intervals, .interval_list git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3106 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-01 12:47:48 +00:00
aaron	c3c6e632d1	support for two new VCF header info field value-types, Flag (for fields that are just boolean truths), and Character (for single charatcer info fields). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3105 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-01 03:11:32 +00:00
aaron	3d3d19a6a7	the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date. a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects. This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc. This layer is needed so we can integrate Tribble tracks (which don't natively have names). Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc). Sorry for the inconvenience! More changes to come, but this is by far the largest (as has the greatest effect on end users). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-31 22:39:56 +00:00
hanna	400684542c	Revisions to take into account finalization of Picard patch: naming changes, better definition of public interfaces. This won't be the last Picard patch, but it should be the last big one. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3096 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-30 19:28:14 +00:00
hanna	85037ab13f	Fix for Kiran's sharding issue (Invalid GZIP header). General cleanup of Picard patch, including move of some of the Picard private classes we use to Picard public. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3087 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-29 03:21:27 +00:00
depristo	b8ab74a6dc	Minor useful changes to BaseUtils and MathUtils to support a new haplotype score annotation that determines to the two most likely haplotypes over an interval and scores variants by their consistency with a diploid model. Appears to be useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3085 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-28 21:45:22 +00:00
ebanks	47e30aba92	Rods for reads hooked up into the cleaner git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3070 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-24 18:17:56 +00:00
ebanks	49117819f5	For the cleaner to clean, it must beat the entropy produced by the aligner (and not just the raw reads). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3068 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-24 15:21:58 +00:00
aaron	a69b8555dd	Geli to variant context. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3063 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-23 06:45:29 +00:00
aaron	eafdd047f7	GLF to variant context. Added some methods in GLF to aid testing; and added a test that reads GLF, converts to VC, writes GLF and reads back to compare. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3062 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-23 03:43:25 +00:00
hanna	3767adb0bb	Processing intervals as they stream in means much lower memory usage and quicker runtime. Making change as minimal as possible to avoid conflicts with BT's incoming patch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3061 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 22:04:45 +00:00
ebanks	0097106938	VariantFiltration can now filter specific samples. This is NOT an ideal implementation. One day when we have lots of free time (or a greater desire), we will implement this correctly and sophisticatedly using all the power of JEXL. For now, though, this will have to do. Docs coming tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3060 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 20:45:11 +00:00
depristo	076d21d394	Minor bug workaround in GenotypeConcordance module (see todo). General platform read filter. You can say -rl Platform illumina to remove all SLX reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3054 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 02:47:09 +00:00
ebanks	c88a2a3027	Fixing/cleaning up the vcf merge util git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3047 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 15:13:32 +00:00
depristo	56092a0fc2	Slight cleanup for mathutils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3042 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 13:18:08 +00:00
ebanks	03480c955c	And now the UnifiedGenotyper can officially annotate genotype (FORMAT) fields too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3039 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 04:58:37 +00:00
ebanks	e757f6f078	Missing value for arbitrary format entries is empty string (need to revisit at some point, but it will require updating the VCF spec). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3038 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 03:56:27 +00:00
ebanks	0311980668	The VariantAnnotator can now officially annotate genotype (FORMAT) fields. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3037 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 03:30:14 +00:00
ebanks	ee0e833616	Some significant changes to the annotator: 1. Annotations can now be "decorated" with any arbitrary interface description - not just standard or experimental. 2. Users can now not only specify specific annotations to use, but also the interface names from #1. Any number of them can be specified, e.g. -G Standard -G Experimental -A RankSumTest. 3. These same arguments can be used with the Unified Genotyper for when it calls into the Annotator. 4. There are now two types of annotations: those that are applied to the INFO field and those that are applied to specific genotypes (the FORMAT field) in the VCF (however, I haven't implemented any of these latter annotations just yet; coming soon). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3029 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-18 05:38:32 +00:00
rpoplin	58a31bab6a	Variant optimizer now outputs VCF files via ApplyVariantClustersWalker. Documentation to be added to the wiki. It is ready to be used by other people but only with great caution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3028 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 20:41:42 +00:00
hanna	d9398dc347	Remove some of the restrictions on getStart() and getStop(); getStart() and getStop() now do the minimum validation rather than the more rigorous only-within-the-contig-bounds header validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3027 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 19:39:30 +00:00
ebanks	ded4ba8966	Let's make artificial reads that actually adhere to the specs... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3022 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:51:42 +00:00
bthomas	5b34bb9ab0	Adding three minor new features: + -L all now walks over all intervals + if a -L argument is passed with a .list extension, and file does not exist, returns a \ File Not Found error instead of "bad interval" error. We plan to soon revisit interval \ lists and generate a concrete list of filenames, so this is likely temporary. + Error is thrown if the start position on an interval is higher number than the end position. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3021 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:24:10 +00:00
ebanks	4340601c26	-Pushed base quals back down into SAMRecord; if -OQ is used, the SAMRecord quals get updated automatically -Better integration test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3020 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:00:10 +00:00
ebanks	1fd909cdaf	Fix for Kiran: -1 is a valid value for genotype qualities in VCF, so VariantContext shouldn't die. Cleaned up the relevant VCF code while I was in there. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3015 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 00:20:15 +00:00
ebanks	586f87fa35	Quick fix git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3007 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-16 02:59:26 +00:00
ebanks	202231141c	-Push the --use_original_qualities argument into the engine. -Check that base and qual strings are the same lengths -Fix one more bug in the clipper. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3006 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-16 02:06:11 +00:00
ebanks	411d25c8d1	-Integration tests for walkers that use original quals. -framework for pushing -OQ into GATK (not done) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3004 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-15 18:46:31 +00:00
kcibul	9f519af06d	new method to filter out overlapping PE reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3002 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-15 15:40:09 +00:00
depristo	4dd7c5972c	Unit tests for -XL arguments; expt. annotation calculating the GC content within 100 bp of the current SNP git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2997 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-14 21:08:14 +00:00
aaron	ecb59f5d0d	removed old tests and old code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2995 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:57:01 +00:00
depristo	e7eae9b61d	High performance, correct implementation of -XL exclusion lists. Enjoy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2994 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:39:20 +00:00
aaron	88a48821ea	removed the dependence on removeRegion() in GenomeLocSortedSet git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2993 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:35:49 +00:00
aaron	1eb5f97255	fixed dropping single base intervals from deleteRegion, moving onto performance fixes. (stop - start is length-1 on closed intervals, so we need to check greater than OR equals to zero) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2990 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 19:14:21 +00:00
hanna	a7ba88e649	Rework the way the MicroScheduler handles locus shards to handle intervals that span shards with less memory consumption. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-11 18:40:31 +00:00
aaron	dde9fd8a15	some rods-for-reads cleaning and performance improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 22:54:58 +00:00
depristo	486bef9318	Support for validationRate calculation in variant eval 2; better error messages for failed genome loc parsing; tolerance to odd whitespace in plinkrod, and fix for monomorphic sites in vcf2variantcontext. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2976 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 16:25:16 +00:00
ebanks	c85ed1ce90	Plumbing is now in place to emit indel calls from the UnifiedGenotyper. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2975 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 04:30:12 +00:00
ebanks	5a20bf0e64	3 changes to UG which break integration tests: 1. emit AA,AB,BB likelihoods in the FORMAT field for Mark 2. remove constraint that genotype alleles (in the GT field) need to be lexigraphically sorted. 3. Add bam file(s) used by genotyper to header for Kiran git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2963 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-09 17:16:47 +00:00
ebanks	9f3b99c11b	Moving UnifiedGenotyper and VariantAnnotator over to VariantContext system. Removing obsolete genotyping classes. First stage of removing dependence on old Genotype class. More changes to come. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2960 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-09 03:41:07 +00:00
hanna	1ef1091f7c	Cleanup and simplification of read interval sharding. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2944 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-05 23:34:38 +00:00
ebanks	0dd65461a1	Various improvements to plink, variant context, and VCF code. We almost completely support indels. Not yet done with plink stuff. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2926 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-04 17:58:01 +00:00
chartl	6759acbdef	Coverage statistics now fully implements DepthOfCoverage functionality, including the ability to print base counts. Minor changes to BaseUtils to support 'N' and 'D' characters. PickSequenomProbes now has the option to not print the whole window as part of the probe name (e.g. you just see PROJECT_NAME\|CHR_POS and not PROJECT_NAME\|CHR_POS_CHR_PROBESTART-PROBEND). Full integration tests for CoverageStatistics are forthcoming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2924 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-04 15:00:02 +00:00
aaron	ca2cd9d4f5	a little clean-up: move setting the bases of generated reads into Artificial SAM Utils now that the clean read injector test is gone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2919 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 16:31:45 +00:00
aaron	790d2a7776	adding the initial ROD for Reads support; more convenience methods in ReadMetaDataTracker to come. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2918 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 15:56:44 +00:00
ebanks	0e9a6826b0	Update to VCF code to get it up to spec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2917 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 06:12:42 +00:00
ebanks	5f3c80d9aa	1. To make indel calls, we need to get rid of the SNP-centricity of our code. First step is to have the reference be a String, not a char in the Genotype. Note that this is just a temporary patch until the genotype code is ported over to use VariantContext. 2. Significant refactoring of Plink code to work in the rods and use VariantContext. More coming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2913 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-02 20:26:40 +00:00
kcibul	7578678f99	refactored to provide a sum of mismatch quality scores capability as well (used by Cancer) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2911 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-02 16:40:03 +00:00
aaron	246fa28386	RODs for reads phase 2: modified RODRecordList to implement List<ReferenceOrderedDatum> so I could stub it out for testing, added a FlashBackIterator which is needed to prevent the ResourcePool from opening infinity+1 iterators, and some other interfaces to make unit testing much smoother. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2892 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-25 22:48:55 +00:00
hanna	199b43fcf2	Reduce by interval alterations to interface with new sharding system. This checkin with be followed by a simplification of some of the locus traversal code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2886 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-25 00:16:50 +00:00
aaron	fef1154fc8	starting on RODs for Reads: made RODRecordList implement list<RODatum> (so we can sub in fake lists during testing), and removed unnecessary generic-ness. Removed BrokenRODSimulator, which isn't being used. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2884 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-24 22:11:53 +00:00
aaron	5546aa4416	adding code to deal with the off-spec situation where our minimum likelihood is above the GLF max of 255. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2871 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-22 22:27:39 +00:00
alecw	b236714c8a	Optimization - Added method to Covariates: void getValues( SAMRecord read, Comparable[] comparable ) which takes an array of size (at least) read.getReadLength() and fills it with covariate values for all positions in the given read. Made CovariateCounterWalker and TableRecalibrationWalker use this method instead of calling getValue(..) for each covariate and each offset. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2863 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-22 17:35:25 +00:00
aaron	33ae256186	a start to some of the infrastructure for Tribble, including dynamic detection of new RMD; not nearly wired in or complete yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2855 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-18 18:43:52 +00:00
ebanks	79ab7affda	- Change sortOnDisk option to sortInMemory - Fix horrible cleaner bug - Trivial optimizations to cleaner code - more significant ones coming soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2850 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-17 20:52:57 +00:00
aaron	653f70efa2	added methods to validate an interval before you try to make a GenomeLoc: boolean validGenomeLoc(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2846 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-16 20:35:35 +00:00
rpoplin	3de72daa88	Removing an accidently added import statement. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2818 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-10 15:54:24 +00:00
rpoplin	0b1e243a7b	CountCovariates now sorts the list of standard covariate classes coming from PackageUtils.getClassesImplementingInterface(). As a result some of the integration tests now make use of -standard git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2817 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-10 15:52:20 +00:00
depristo	934d4b93a2	VariantContext to VCF converter. BeagleROD, and phasing of VCF calls. Integration tests galore :-) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2814 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-09 19:02:25 +00:00
depristo	94f892ad42	VCF->beagle and VCF phasing using beagle input. Appears to work fairly well. VariantContexts now support phased genotypes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2812 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-09 01:22:05 +00:00
kshakir	fc810a1800	Updated VCF Reader to parse VCFs according to the VCFv3.3 spec. Column headers are tab separated since sample names might have spaces. Updated test files in /humgen/gsa-scr1/GATK_Data/Validation_Data/*.vcf to remove spaces except for when they are supposed to be in the sample name. Added @Test before VCFReaderTest.testHeaderNoRecords() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2809 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-08 22:55:59 +00:00
hanna	21369869b7	Extend regex that supports every 'word' character to use any printable character except ':'. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2807 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-08 03:29:55 +00:00
depristo	af8c47fc2f	Fixing up testVariantContext for integration tests for variant context. Printing of VCs and genotypes now stable using sorting. Cleaned up comments in quality score by strand. RefMetaDataTracker now directly allows walkers to obtain VariantContexts using the simple Collection<VariantContext> getAllVariantContexts(GenomeLoc curLocation, EnumSet<VariantContext.Type> allowedTypes, boolean requireStartHere, boolean takeFirstOnly) function. VCF and dbSNP VariantContexts now officially supported. Other importan types can be added to the adapator system in refdata package. Integration tests later today git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2791 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-05 15:42:54 +00:00
ebanks	83b9d63d59	1. Added functionality to the data sources to allow engine to get mapping from input files to (merged) read group ids from those files. 2. Used said mapping to implement N-way-in,N-way-out functionality in the new indel cleaner. Still needs more testing (to be done after vacation but preliminary tests look good). 3. Fixes to VCF validator: ignore case when testing VCF reference base against true reference base and allow quals of -1 (as per spec). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2773 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-04 04:12:49 +00:00
chartl	2c4f709f6f	Bunch of oneoff stuff that I don't want to lose. Also: VCFRecord - "." dbsnp-ID entries now taken into account (thought these were represented as null; but I guess not) VCFGenotypeRecord - added a replaceFormat option; since intersecting Broad/BC call sets required genotype formats also be intersected (no changing on-the-fly) VCFCombine - altered doc to instruct user to give complete priority list (was throwing exception if not) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2760 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 21:35:10 +00:00
asivache	421282cfa3	Convenience method: getMappingFilteredPileup(int minMapQ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2759 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 21:19:53 +00:00
depristo	d9671dffba	Documentation for VariantContext. Please read it and start using it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2756 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 17:49:51 +00:00
chartl	236764b249	Major (and useful) changes to MultiSampleConcordance: 1) Now cares about Genotype filtering. If it is flagged as filtered, it can count as a FP/FN/TP; but goes into a "non-confident genotype" bin, rather than het/hom. 2) Can give it a Genotype Confidence flag (-GC) which will automatically filter genotypes in the way above for quality > Q for "-GC Q" 3) Can give it an -assumeRef flag. For sites only in the truth VCF (that don't even appear in the variant VCF), that locus will be treated as confident ref calls for all individuals in the variant VCF; and the calculators updated accordingly. *** Important: Default behavior is that sites unique to the truth VCF are considered no-call sites for the variant. This flag can help get aroudn that; however the safest way to run this is to have a variant VCF with calls at each and every locus, if that is possible. VCFGenotypeRecord -- added an isFiltered() call to automate looking up the FILTERED flag for VCF v3.3 SimpleVCFIntersectWalker - basic outline for a walker I'm working on tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2747 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-30 01:18:31 +00:00
aaron	ac2a207b0b	added a wrapper exception for anything that goes wrong in VCF parsing; this way the problematic file line is emitted, no matter what happens. Makes debugging a lot easier, especially in large files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2739 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 19:58:51 +00:00
chartl	d57a86ad41	Not nearly as badass as it looks. The problem I mentioned yesterday with "bleeding in" of samples comes from VCFUtils and SampleUtils looking for all VCF-class RODs in the tracker, and stealing the name from them. I have introduced a new HapmapVCF - type rod for use when you want to protect your VCF header from being infected by the samples in a bound hapmap VCF. Changes are as follows: VCFRecord - minor change to adapt isNovel() to the case where the dbsnp ID field is empty, but the info field has DB=1 HapmapVCFRod - introduced for the reason at the top RODRecordIterator - was: catch ( Exception e ) { throw new StingException("long ass message") } is now: catch ( Exception e ) { throw new StingException("long ass message",e) } to permit full stack ejaculation. RodVCF - Now with more brackets! ReferenceOrderedData - registering HapmapVCF as a bindable string VariantAnnotator - There's an extra space on a line. And some new brackets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2733 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 15:19:50 +00:00
hanna	3d922a019f	Basic support for very simple index-driven locus traversals. Interface has been changed to support batched intervals in a single shard, but intervals are not yet compressed into a single shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 03:14:26 +00:00
chartl	7a10c40fb3	Much clearer (and, like, not totally incorrect) implementation of isNovel git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2725 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 21:16:21 +00:00
chartl	8de6a8d246	Lots of changes; all to do something relatively minor. 1) Changed VCF/RodVCF to allow for inquiries to whether or not the site is novel; isNovel() looks at the ID field, and those members of the info field that indicate membership in dbsnp, hapmap2, or hapmap3; and if none can be found, returns true. 2) Changed VariantAnnotator to annotate hapmap2 and hapmap3, if you bind rods to it with those names. Works in the same way as DBSNP does -- if you give it a rod named "hapmap2" it'll annotate membership in it. -- Passes integration tests 3) Changed UnifiedGenotyper to do the same thing (since it uses Annotations as a subroutine) -- Passes integration tests 4) Changed MultiSampleConcordanceWalker to take a flag --ignoreKnownSites (or -novels) to examine concordance only on sites that are not marked as in dbSNP or in Hapmap in the variant VCF 5) Changed VCFConcordanceCalculator (the object MultiSampleConcordanceWalker runs on) to output Concordant_Het_Calls and Concordant_Hom_Calls separately, rather than combined as Concordant_Calls 6) AlleleBalanceHistogramWalker -- I don't know what i did to this thing. I've been jerry rigging System.outs to do stuff it was never really intended to do; so there's probably some dumb System.out.print("HI I AM AT LOCUS:"+loc) stuck somewhere. It compiles at any rate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2724 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 21:06:56 +00:00
depristo	956b570c8e	V5 improvements to VariantContext. Now fully supports genotypes. Filtering enabled. Significant tests throughout system. Support for rebuilding variant contexts from subsets of genotypes. Some code cleanup around repository git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2721 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 18:37:17 +00:00
ebanks	1dd9996f3a	New realigner now completely uses bytes, plus misc fixes. Still not ready for use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2719 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 04:17:20 +00:00
ebanks	fddca032bb	Initial commit of v2.0 of the cleaner. DO NOT USE. (this means you, Chris) Cleaned up SW code and started moving over everything to use byte[] instead of String or char[]. Added a wrapper class for SAMFileWriter that allows for adding reads out of order. Not even close to done, but I need to commit now to sync up with Andrey. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2712 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-27 21:36:42 +00:00
hanna	fa3589e5c5	Update our error messages to point to getsatisfaction.com/gsa. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2706 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-27 19:16:28 +00:00
hanna	022601b1a5	Warnings for walkers w/o Javadoc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2683 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-25 20:34:50 +00:00
hanna	d25a2fe120	Better handling of enums by the command-line argument system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2647 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 21:36:46 +00:00
hanna	1e9fe2a334	Clean up error output when enums have missing arguments. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2645 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:48:26 +00:00
aaron	8d1d37302c	a quick change to GLF to keep as much precision in our likelihoods as long as possible, before we put it into byte space. Sanger was doing a diff at low coverage and noticed our calls didn't contain as much precision as theirs. Updated the MD5 for unified genotyper output. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2644 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:36:49 +00:00
hanna	908d399670	Bug fix for help text / version number - help text retriever was crashing in the debugger if help text hadn't been built. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2643 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:18:19 +00:00
hanna	8dafd26100	Print out the current version number in the application header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2633 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-19 21:58:36 +00:00
hanna	1488578617	Working with Aaron to get svnversion running within the build system. This change will break the build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2628 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-19 16:55:42 +00:00
depristo	41392f8ff5	functions for setting gentoype records and alternate bases; function for getting all rods implementing VCF git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2611 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-16 20:19:43 +00:00
hanna	ac4756db20	Add the svn version on the fly to the version number properties. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2607 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-16 00:28:01 +00:00
hanna	420cef4094	Added version numbers to the help doclet extractor. Since the help system is behaving more like a resource bundle at this point, changed it over to use the Java ResourceBundle support classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2606 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 23:31:29 +00:00
hanna	930082314a	Put a major.minor version into the GATK Javadoc for reading. Also, update some straggler packages to the new package-info.java format introduced in 1.5. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2604 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 21:48:30 +00:00
ebanks	b911b7df82	Fixing the AC annotation to be in line with the VCF spec git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2593 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 18:28:52 +00:00
rpoplin	70df30fc1b	Added method to AlignmentUtils which takes a read's cigar and the refBases char array given to a ReadWalker and returns the aligned reference char array. Bug fix in solid_recal_modes to use this aligned reference array. Recalibrator version number is no longer separate for each of the two walkers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2589 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 15:36:59 +00:00
ebanks	2a116bb5d6	Made the VCF validator a simple rod walker instead of having it be in a separate package. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2588 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 06:39:06 +00:00
aaron	db9570ae29	Looks bigger than it is: * Moved GATKArgumentCollection into gatk.arguments folder to clean up the main folder, also added some associated argument classes (most of the changes). * Added code the argument parsing system for default enums, we needed this so we could preserve the current unsafe flag, and at the same time allow finer grained control of unsafe operations. You can now specify: "-U" (for all unsafe operations), "-U ALLOW_UNINDEXED_BAM" (only allow unindexed BAMs), "-U NO_READ_ORDER_VERIFICATION", etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2586 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 00:14:35 +00:00
asivache	d85461c463	MergingIterator completely re-done. Now it is not a generic class (sorry guys), but rather it is tailored for merging ROD tracks. This implementation peeks the locations of next ROD annotations in each track, but does not actually read these RODs from underlying streams until the location is reached and it is time to actually return the object. Now underlying ROD track iterators (registered in the resource pool!) are not advanced prematurely past the current position and all the way to the next ROD record wherever it is, so that the sharding system can reuse them. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2582 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-14 17:43:36 +00:00
ebanks	a082b948a3	Support throughout for S and N cigar elements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2579 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-14 03:45:42 +00:00
ebanks	8ca5bba738	We emit genotype data in the VCF record if the format string instructs us to (regardless of whether or not genotypes are provided - this was the wrong test). SequenomToVCF now correctly has no-calls when probes fail. Re-enabled SequenomToVCF integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2572 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 15:40:27 +00:00
chartl	6d1107a4ed	Update to SequenomToVCF Output changing slightly so integration test disabled temporarily git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2571 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 15:32:05 +00:00
ebanks	f99586f91b	Added integration test for beagle and verbose output in UG. Minor cleanup of VCFRecord code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2570 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 03:55:24 +00:00
ebanks	040fdfee61	Cleaned up the interface to VCFRecord. It's now possible (and easy) to create records and then write them with a VCFWriter. I've updated HapMap2VCF to use the new interface; Chris agreed to take care of Sequenom2VCF. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2558 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-11 21:42:12 +00:00
chartl	dfa3c3b875	Added: SequenomToVCF - Takes a sequenom ped file and converts it to a VCF file with the proper metrics for QC. It's currently a rough draft, but is working as expected on a test ped file, which is included as an integration test. Modified: VCFGenotypeCall -- added a cloneCall() method that returns a clone of the call Hapmap2VCF -- removed a VCFGenotypeCall object that gets instantiated and modified but never used (caused me all kinds of confusion when I was basing SequenomToVCF off of it) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2554 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-11 17:17:21 +00:00
ebanks	971834ca90	Added a walker to the vcf tools compilation: one that combines vcf records. Both merges and unions are supported (see documentation... when it gets written this week). Also, moved some code that pulls samples out of rods from VCFUtils into SampleUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2552 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-10 06:45:11 +00:00
ebanks	b468369dfa	-UG's call into VariantAnnotator now uses the full alignment context (as opposed to the filtered one) -MQ0 annotation is now standard again -Added AC and AN annotations to VCF output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2545 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-08 05:40:42 +00:00
rpoplin	5f58492401	A rogue QualityUtils.MAX_REASONABLE_Q_SCORE managed to get through my previous bug fix. It should instead check the command line -maxQ argument. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2540 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 21:17:39 +00:00
ebanks	9a658e6b18	-Fixed VCF header line bug -Added useful trim() method for Strings for characters other than whitespace git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2538 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 17:51:41 +00:00
ebanks	b643a513bb	Minor interface change for VCFGenotypeRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2537 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 16:48:09 +00:00
depristo	076481f786	Fixes to mergeVCF -- now correctly supports merging of filter fields. Also removed incorrect hasFilteringCodes() function. Updated intergration tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2535 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 14:50:13 +00:00
ebanks	6c739e30e0	1. Removing an old version of the Genotype interface which is no longer being used. Needed to do this now so that the naming conflicts would cease. 2. Adding a preliminary version of the new Genotype/Allele interface (putting it into refdata/ as the VariantContext really only applies to rods) with updates to VariantContext. This is by no means complete - further updates coming tomorrow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2533 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 05:51:10 +00:00
depristo	a9245a58e2	Fix for incorrect exception throwing in VCFRecord. It is reasonable to ask for the non-ref allele freq at all ref sites. Was only passing in tests because isReference was broken git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2532 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 01:18:30 +00:00
depristo	7215526810	Fix to isReference() in VCFRecord. Change to VariantCounter to correctly counter only non-genotype variants, as well as update to VariantEvalWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2531 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 00:03:29 +00:00
andrewk	6c4ac9e663	Updated HapMap2VCF to use the VCFGenotypeWriterAdapter interface; fixed bug in VCFParameters that affects VariantsToVCF and HapMap2VCF when reference is lower-cased; added integration test for HapMap2VCF that checks for the lower-case issue by testing against Hg18 region that has lower-cased bases git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2530 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 21:27:11 +00:00
chartl	a32245f7d2	Modifications: QualityUtils - Stole the BaseUtils code for flipping reads around and applied it to quality scores SecondBaseSkew - Nothing's really different, just a commented line Additions (experimental annotations for future development of second-base annotation) I DO NOT INTEND FOR ANYONE TO USE THESE - ProportionOfNonrefBasesSupportingSNP - ProportionOfSNPSecondBasesSupportingRef - ProportionOfRefSecondBasesSupportingSNP + I hope these are self-explanatory - QualityAdjustedSecondBaseLod + Adjust lod-score by 10*log10[P[second bases are as observed]] Added walker: QualityScoreByStrand - oneoff project that's being saved if i ever need it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2527 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 19:18:07 +00:00
asivache	eb899741e1	reverting last changes. no cacheing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2526 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 18:59:37 +00:00
asivache	a17d725c35	Cache pileup bases and mapping quals after first call to getBases() and getMappingQuals(), respectively. Subsequent calls to these method will return cached arrays. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2525 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 18:05:00 +00:00
ebanks	d6fb19bb67	Don't hard-code base qual max git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2524 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 17:21:44 +00:00
depristo	592749a7c1	isNBase method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2513 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 15:01:51 +00:00
depristo	5ce11c3dad	toString method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2512 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 15:01:20 +00:00
depristo	bca3d1b943	useful convenience function to get a genotype associated with a particular sample git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2510 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 14:53:56 +00:00
depristo	ec774f62be	Some checking to protect the BasicGenotype git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2509 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 14:53:24 +00:00
ebanks	ed2fff13aa	-Misc improvements to VCF code -Small fix to callset concordance git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2497 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-04 02:28:47 +00:00
ebanks	7b702b086f	You don't need to be bi-allelic to have a non-ref alt allele frequnecy, but you do have to be a variant. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2495 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-03 22:02:39 +00:00
asivache	a41cb0701b	Now can generate verbose String representation of deletions (e.g. "-AAT") if reference bases are provided as an argument to getEventStringWithCounts(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2488 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 21:54:50 +00:00
asivache	89791d730e	Compute and cache the length of the longest deletion observed at the site; ReadBackedExtendedEventPileup now has a getter to access that value. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2487 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 21:19:39 +00:00
rpoplin	80658fd99e	AnalyzeCovariates gets the same performance improvements as the recalibrator. NHashMap class is removed completely. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2483 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 18:10:10 +00:00
rpoplin	9b2733a54a	Misc clean up in the recalibrator related to the nested hash map implementation. CountCovariates no longer creates the full flattened set of keys and iterates over them. The output csv file is in sorted order by default now but there is a new option -unsorted which can be used to save a little bit of run time. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2482 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 16:58:04 +00:00
asivache	8330058216	method added: getEventStringsWithCounts() Returns list of Pairs <String,Integer>, where each pair consists of a unique indel event observed at the site and the total number of observations of that event. String representation for insertions is verbose (e.g. +ACT), while deletions are represented as "5D" (since read backed pileup has no reference information, so we can not get actual sequence of deleted bases) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2479 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:41:58 +00:00
asivache	cf3e59eb4a	back to archive git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2478 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:00:38 +00:00
asivache	295d16572e	synch; will go back to archive in a sec git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2477 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:00:03 +00:00
rpoplin	96c4929b3c	Recalibrator now uses NestedHashMap instead of NHashMap. The keys are now nested hash maps instead of Lists of Comparables. These results in a big speed up (thanks Tim!). There is still a little bit of clean up to do, but everything works now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2474 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 21:01:32 +00:00
asivache	f445745c56	Pileup element and corresponding container class tweaked for representing pileups of extended events (indels) at a given locus. There's some redundancy with PileupElement and ReadBackedPileup (should we rename them to BasePileupElement and ReadBackedBasePileup?), so that abstracting a basic interface/abstract base from these classes can be considered in the future git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2469 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 20:03:39 +00:00
depristo	87e863b48d	Removed used routines in duputils; duplicatequals to archive; docs for new duplicate traversal code; general code cleanup; bug fixes for combineduplicates; integration tests for combine duplicates walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2468 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 19:46:29 +00:00
ebanks	5fdf17fccb	Removed the VCF "NS" annotation (which wasn't working for pooled calls anyways) since it's ambiguous and not useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2465 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 17:30:47 +00:00
hanna	e32174fbc4	UnifiedGenotyper now works without -varout or -vf set. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2464 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 16:46:24 +00:00
ebanks	aeb34758e6	Adding a validation stringency to the VCF writers (which defaults to STRICT). If set to SILENT, it will not throw an exception for (reasonable) off-spec requests but will instead ignore such requests and silently move on. This change allows the pooled calculation model to work correctly with multiple threads. Boys, the Genotyper is now officially parallelized. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2462 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 15:33:53 +00:00
depristo	fcc80e8632	Completely rewritten duplicate traversal, more free of bugs, with integration tests for count duplicates walker validated on a TCGA hybrid capture lane. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2458 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 23:56:49 +00:00
rpoplin	92e3682991	Moved NHashMap to sting/utils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2452 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 20:57:32 +00:00
ebanks	b1ac4b81d5	Optimization: look up diploid genotypes from a static matrix instead of creating them on the fly (with String.format); bases no longer need to be ordered appropriately git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2448 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 17:28:51 +00:00
ebanks	d2770f380c	Writing calls to standard out now works again (it got broken when we introduced parallelization) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2446 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-27 04:36:45 +00:00
ebanks	0571d9dcb9	Point MAX_QUAL_SCORE to SAMUtils.MAX_PHRED_SCORE. Also, array size for caches should be max score + 1. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2444 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 20:47:32 +00:00
aaron	b134e0052f	added changes to the code to allow different types of interval merging, 1: all overlapping and abutting intervals merged (ALL), 2: just overlapping, not abutting intervals (OVERLAPPING_ONLY), 3: no merging (NONE). This option is not currently allowed, it will throw an exception. Once we're more certain that unmerged lists are going to work in all cases in the GATK, we'll enable that. The command line option is --interval_merging or -im git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2437 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:59:14 +00:00
alecw	159778416c	In TableRecalibrationWalker, update UQ tag if it was present in the original SAMRecord. This required a new sam.jar, which caused some other files to need to be changed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2435 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:42:36 +00:00
hanna	0d890e1bf0	Rework Eric's output management code given that the behavior of the UG changes drastically depending on its output format. Current implementation is probably a bit overkill-ish and we can whittle this down to what's absolutely necessary. Writing VCFs to the 'out' protected printstream may not work at this moment. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2425 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 00:33:43 +00:00
ebanks	cf303810d3	VCF reader now creates the correct type of header line for each header type git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2423 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 20:39:06 +00:00
hanna	b780ffb34a	Add a getFormat() method to get the output format from the writer. The need for this call suggests that I may be thinking about the typing of the GenotypeWriter object the wrong way. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2418 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 01:46:26 +00:00
hanna	11cbfcec9c	Get rid of backlink from ArgumentDefinitions to ArgumentSources. This will help in the future with multiple source -> single definition mapping sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2417 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 00:39:36 +00:00
aaron	7e0f69dab5	Changed the GLF record to store it's contig name and position in each record instead of in the Reader. Integration tests all stay the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2410 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 22:54:56 +00:00
ebanks	4ea31fd949	Pushed header initialization out of the GenotypeWriter constructors and into a writeHeader method, in preparation for parallelization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2406 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 19:16:41 +00:00
ebanks	eeddf0d08e	Adding sample utils for convenience methods to pull out samples from e.g. SAMFileHeader or Genotype objects git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2405 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 18:51:21 +00:00
ebanks	4f59bfd513	Updates to the various GenotypeWriters to make them do simple things like write records (plus allow GLFReader to close). Adding first pass of stub and storage classes for the GenotypeWriters so that UG can be parallelizable. Not hooked up yet, so UG is unchanged. The mergeInto() code in the storage class is ugly, but it's all Tribble's fault. We can clean it up later if this whole thing works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2400 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 07:20:23 +00:00
ebanks	94f5edb68a	1. Fixed VCFGenotypeRecord bug (it needs to emit fields in the order specified by the GenotypeFormatString) 2. isNoCall() added to Genotype interface so that we can distinguish between ref and no calls (all we had before was isVariant()) 3. Added Hardy-Weinberg annotation; still experimental - not working yet so don't use it. 4. Move 'output type' argument out of the UnifiedArgumentCollection and into the UnifiedGenotyper, in preparation for parallelization. 5. Improved some of the UG integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2398 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 04:14:14 +00:00
rpoplin	6fbf77be95	Updating the two solid_recal_mode options to also change the previous base since solid aligner prefers single color mismatch alignments over true SNP alignments. COUNT_AS_MISMATCH mode has been removed completely. The default mode is now SET_Q_ZERO. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2394 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-17 20:07:26 +00:00
ebanks	bb92e31118	Optimizations: 1. push the ReadBackedPileup filtering up into the ReadFilters for read-based filters 2. stop querying the cigar for its length (just do it once) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2381 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-16 21:39:58 +00:00
ebanks	bb312814a2	UG is now officially in the business of making good SNP calls (as opposed to being hyper-aggressive in its calls and expecting the end-user to filter). Bad/suspicious bases/reads (high mismatch rate, low MQ, low BQ, bad mates) are now filtered out by default (and not used for the annotations either), although this can all be turned off. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2373 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-16 17:28:09 +00:00
depristo	0d2a761460	Bugfix for minBaseQuality to ignore deletion reads. LocusMismatch walker now allows us to skip every nths eligable site git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2357 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 14:38:39 +00:00
ebanks	bf7bab754e	Made getPileupWithoutMappingQualityZeroReads() and getPileupWithoutDeletions() more efficient, per Mark's cue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2356 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 04:35:21 +00:00
ebanks	874552ff75	Pull the genotype (and genotype quality) calculation out of the VCF code and into the Genotyper. [Also, enable Mark's new UG arguments] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2355 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 04:29:28 +00:00
depristo	2cbc85cc7a	min mapping quality and min base quality arguments for UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2354 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 03:57:27 +00:00
depristo	1da97ebb85	Walker for calculating non-independent base errors, v1. Will be moved to somewhere not in core git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2352 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 02:40:15 +00:00
chartl	b42fc905e8	Added - new tests (Hapmap was re-added) Modified - Hapmap now takes a -q command to filter out variants by quality Modified - MathUtils - cumBinomialProbLog now uses BigDecimal to handle some numerical imprecisions Modified - PowerBelowFrequency - returns 0.0 if called with a negative number (can't be done from inside the walker itself, but since it's called elsewhere one can't be too careful) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2350 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-14 21:57:20 +00:00
asivache	bd7b07f3f1	added PrimitivePair.Long and a few shortcut utility methods to PrimitivePairs: add(pair), subtract(pair), assignFrom(pair) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2347 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-14 00:15:44 +00:00
ebanks	97618663ef	Refactored and generalized the VCF header info code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2346 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-13 21:02:45 +00:00
ebanks	bd2a46ab4c	I want to move over to hpprojects tonight, so I'm checking in various changes all in one go: 1. Initial code for annotating calls with the base mismatch rate within a reference window (still needs analysis). 2. Move error checking code from rodVCF to VCFRecord. 3. More improvements to SNP Genotype callset concordance. 4. Fixed some comments in Variation/Genotype git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2341 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-13 02:52:18 +00:00
hanna	6955b5bf53	Cleanup of the doc system, and introduce Kiran's concept of a detailed summary below the specific command-line arguments for the walker. Also introduced @help.summary to override summary descriptions if required. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2337 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-12 04:04:37 +00:00
hanna	cdfe204d19	Incorporated feedback from Kiran. Use the Javadoc first sentence extraction capability to just show the first sentence from each line of Javadoc. @help.description can still be used to produce exceptionally verbose descriptions. Also increased the line width as much as I could tolerate (100 characters -> 120 characters). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2336 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-11 21:59:55 +00:00
aaron	09811b9f34	Now that we always output the VCF header, make sure that we correctly handle the situation where there are no records in the file. Added unit tests as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2333 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-11 19:51:05 +00:00
depristo	8f7554d44f	A few improvements to pooled concordance calcluations. Now will show you FN with the -V option. BasicGenotype now prints out a reasonable representaiton wiwth toString git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2320 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 23:09:10 +00:00
ebanks	2869270c11	Fixed deletion depth calculation plus mis-spelling in ReadBackedPileup method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2315 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 21:11:42 +00:00
hanna	5eac510b2f	Refactor the code I gave Eric yesterday to output command line arguments. Convert it from a completely wonky solution to a slightly less wonky solution that will work in more cases. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2310 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 18:57:54 +00:00
ebanks	a45adadf1f	VCFGenotypeRecord already defines all the methods needed to be SampleBacked, so let's annotate it as being SampleBacked. This way, when used as a generic Genotype, sample data can be retrieved. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2305 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 04:16:21 +00:00
ebanks	4e54b91ce4	UG now outputs the FORMAT header fields when there's genotype data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2294 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 16:31:07 +00:00
ebanks	7a76e13459	Better explanation in the exception being thrown. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2291 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 03:59:36 +00:00
ebanks	717eb1de96	- Depth annotation now includes MQ0 reads - Removed MQ0 annotation - Updated RMS MQ annotation to use new pileup - UG now outputs all of its arguments as key/value pairs in the header (for VCF) - Cleaned up VCFGenotypeWriterAdapter interface a bit git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2288 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 02:53:00 +00:00
ebanks	e8822a3fb4	Stage 3 of Variation refactoring: We are now VCF3.3 compliant. (Only a few more stages left. Sigh.) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2287 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-08 21:43:28 +00:00
hanna	9e2f831206	A bit of cleanup in preparation for Picard patch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2286 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-08 16:09:04 +00:00
hanna	d3b78338da	Get rid of characters in the docs that aren't universally compatible with character sets used throughout the group. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2285 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 21:41:07 +00:00
hanna	d75d3a361a	Clean up some of the walker help output based on additional experience and feedback received. Also, add a flag to build.xml to disable generation of docs on demand (use ant -Ddisable.doc=true to disable docs). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2284 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 21:33:11 +00:00
hanna	a3e88c0b1c	Cleanup results of bad merge. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2281 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 19:30:49 +00:00
hanna	10be5a5de9	Move some files around to reflect our growing help infrastructure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2280 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 19:23:12 +00:00
rpoplin	1d5b9883db	Added --solid_recal_mode argument to experiment with different ways of dealing with solid reference bias. Currently the default option is DO_NOTHING which means use the same behavior as the old recalibrator. Eventually the new methods in RecalDataManager will be moved over to a SolidUtils class. Added transition and transversion methods to BaseUtils that work like simpleComplement, used with the color space in my solid methods. Also, initial check-in of HomopolymerCovariate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2276 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 14:26:27 +00:00
hanna	8089aa3c50	Adding support to override the help text. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2273 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-07 00:16:26 +00:00
ebanks	c0528cd88e	Updated the CallsetConcordance classes to use new VCF Variation code... and uncovered a whole bunch of VCF bugs in the process. I'm not convinced that I got them all, so I'll unit test like crazy when the refactoring is done. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2272 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-06 11:43:40 +00:00
ebanks	b6f8e33f4c	Stage 2 of Variation refactoring: VCFRecord now implements Variation, VCFGenotypeRecord now implements Genotype. Because of this change, RodVCF is now just a wrapper around the VCFRecord and does nothing else. Also, one can call toVariation on the VCFGenotypeRecord and it returns the VCFRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2271 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-06 06:48:03 +00:00
hanna	3b440e0dbc	Add a taglet to allow users to override the display name in command-line help. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2270 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-06 04:12:10 +00:00
ebanks	08f2214f14	Stage 1 of massive Variation/Genotype refactoring. This stage consists only of the code originating in the Genotyper and flowing through to the genotype writers. I haven't finished refactoring the writers and haven't even touched the readers at all. The major changes here are that 1. Variations which are BackedByGenotypes are now correctly associated with those Genotypes 2. Genotypes which have an associated Variation can actually be associated with it (and then return it when toVariation() is called). The only integration tests which need to be updated are MSG-related (because the refactoring now made it easy for me to prevent MSG from emitting tri-allelic sites). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2269 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-06 03:12:41 +00:00
hanna	b04de77952	First pass at a reorganized walker info display. Groups walkers by package and displays walker data extracted from the JavaDoc. Needs a bit of help, both in content and flexibility of package naming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2267 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-04 23:24:29 +00:00
depristo	07b88621c5	Improved RankSum calculations and RankSum annotation. Much more meaningful git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2266 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-04 22:16:40 +00:00
hanna	4c147329a9	Turn javadoc comments for packages and classes into key/value pairs in a properties file. Embed the properties file in GenomeAnalysisTK.jar. Still no support for actually displaying the archived javadoc. Also change the approach to providing package javadocs: retired the deprecated package.html file in favor of Java1.5-style package-info.java. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2263 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-04 20:08:41 +00:00
ebanks	b05e73a914	Finished implementation of the Wilcoxon Rank Sum Test thanks to Tim Fennell (calculating the normal approximation) and Nick Patterson (dithering to break tie bands). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2255 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-04 04:04:39 +00:00
ebanks	9da5cc25ad	More archiving (with permission from Andrey) plus a move to core. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2242 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-03 15:40:27 +00:00
aaron	b3bdcd0e60	make sure we close the error log stream in CommandLineProgram if it's opened; unit tests and clean-up for BasicVariation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2241 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-03 06:59:27 +00:00
ebanks	2c83f2f2bc	Move MSG - plus now obsolete classes which it depends on -- to oneoffprojects (with permission from Jared). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2224 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 20:04:22 +00:00
ebanks	2838629724	-VCF writer now checks whether the allele frequency has been set before trying to write it out. -Renamed methods to be more consistent. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2214 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 16:25:32 +00:00
depristo	6231637615	fixes for VariantAnnotations and second bases. Misc. removal of failing (and unstable) integration tests that require rereview git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2213 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 15:41:35 +00:00
jmaguire	adf8f1f8b3	Add an InputStream constructor, which is immensely useful for various reasons. Also a minor performance optimization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2201 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 17:25:00 +00:00
ebanks	084337087e	Removing deprecated code and walkers for which I had the green light from repository. Moved piecemealannotator and secondarybases to archive. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2195 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 05:58:20 +00:00
ebanks	7c6c490652	An unfinished implementation of the Wilcoxon rank sum test and a variant annotation that uses it. I need to merge and update this code with Tim's implementation somehow - but that won't happen until later this week, so I'm committing this before I accidentally blow it away. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2193 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 04:56:17 +00:00
ebanks	00f15ea909	Improved performance of deletion-free pileup and added mapping-quality-zero-free pileup convenience method. Finished converting genotyper and annotator code to new ReadBackedPileup system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2192 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 04:50:47 +00:00
depristo	e793e62fc9	minor code cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2189 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 20:57:20 +00:00
ebanks	add2fa7ab4	more use of new ReadBackedPileup optimizations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2187 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 20:04:01 +00:00
ebanks	a184d28ce9	Completing the optimization started by Matt: we now wrap SAMRecords and SAMReadGroupRecords with our own versions which cache oft-used variables (e.g. platform, readString, strand flag). All walkers automagically get this speedup since the wrapping occurs in the engine. I note that all integration/unit tests pass except for BaseTransitionTableCalculatorJava, which is already broken. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2182 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 17:39:29 +00:00
depristo	75b61a3663	Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 23:30:39 +00:00
depristo	db40e28e54	ReadBackedPileup in all its glory. Documented, aligned with the output of LocusIteratorByState, and caching common outputs for performance git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2165 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 20:54:44 +00:00
depristo	03342c1fdd	Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 03:51:41 +00:00
ebanks	3484f652e7	1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info. 2. Killed OnOffGenoype 3. SpanningDeletions is now SpanningDeletionFraction git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:47:20 +00:00
ebanks	e05cb346f3	GenotypeLocusData now extends Variation. Also, Variations should be INSERTIONs or DELETIONs (and not just INDELs). Technically, VCF records can be indels now. More changes coming git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2150 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:07:55 +00:00
aaron	8fbc0c8473	fix for bug GSA-234: fasta index files couldn't handle anything but letters, numbers, or spaces in the contig name git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2147 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 19:19:47 +00:00
ebanks	b3f561710f	Optimizations: 1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having). 2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler) UG now runs in half the time for JOINT_ESTIMATE model. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 16:27:39 +00:00
ebanks	cb6d6f2686	Very minor performance improvements git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2137 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 05:21:07 +00:00
ebanks	c90bea39a1	read.getReadString().charAt(offset) --> read.getReadBases()[offset] [As a courtesy I fixed all instances once I was updating GenotypeLikelihoods] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 04:25:19 +00:00
ebanks	be6a549e7b	Added the capability to allow expressions in an integration test command (i.e. -filter 'foo') by escaping them in the command. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2132 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:34:48 +00:00
ebanks	dfe7d69471	1. VCF: don't print slod if it's never set 2. UG: don't print slod if lods are infinite (todo: figure out a good guess instead) 3. UG: if probF=0 for 2 alt alleles are both 0 (because of precision), use log values to discriminate git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2116 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 02:55:43 +00:00
ebanks	04d6ac940c	Always print out VCF header - not just when there is genotype data present. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2114 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:44:10 +00:00
ebanks	bf935a6ab1	1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail. 2. Retired allele frequency range from UG, which wasn't very useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:31:48 +00:00
aaron	33dcfc858d	updates to the paper genotyper based on Mark's comments. There's still more work to do, including more testing. Also a 250% improvement in the getBases() and getQuals() of BasicPileup, which was nearly all of the runtime for the genotyper (using primitives instead of objects when possible). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2097 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 23:06:49 +00:00
aaron	6ba1f3321d	Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord. Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 18:17:47 +00:00
ebanks	a70cf2b763	A bunch of changes needed to make outputting pooled calls possible git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2073 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 18:42:57 +00:00
ebanks	0a35c8e0ba	1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele. 2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense). 3. If in pooled mode, throw an exception of pool size isn't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:43:15 +00:00
depristo	6fe1c337ff	Pileup cleanup; pooled caller v1 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:03:48 +00:00
chartl	43bd4c8e8f	Ignoring deletions in the primary pileup by default was causing the primary pileup to become shorter than the secondary pileup when building up the secondary base pileup string. This fix makes sure to include the primary Ds within the pileup so that not only are the pileups guaranteed to be the same size, the same offsets will truly correspond with the same read. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2058 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 17:20:13 +00:00
aaron	aece7fa4c7	a convenience method to join a map into a single string, which I need for some VCF work. Added some documentation to the join method as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2057 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 16:50:01 +00:00
ebanks	4558375575	Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated. VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper. UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it). This is a fairly all-encompassing check in. It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout. All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing. Stage 2 of the process will happen later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:41:20 +00:00
depristo	cff31f2d06	comments for eric git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2035 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 14:19:31 +00:00
aaron	234bb71747	changed the toVariation() method to take a reference base, instead of using the reference base loaded from the underlying data source (if it was reference aware). Also changed some isVariant() methods which weren't using the passed in ref base. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2034 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 06:54:38 +00:00
ebanks	555fb975de	1. Print out allele frequency range (from joint estimation model only). 2. Don't print verbose output from SLOD calculation (it's just a repeat of previous output). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2032 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 03:59:13 +00:00
ebanks	61b5fb82ce	2 major changes: 1. Add dbsnp RS ID to VCF output from genotyper; to do this I needed to fix the dbsnp rod which did not correctly return this value. 2. Remove AlleleBalanceBacked and instead generalize the arbitrary info fields backing VCFs (and potentially others) in preparation for refactoring VariantFiltration next week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2028 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 22:51:49 +00:00
mmelgar	3742a05760	Now can read E2 or SQ tag. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2027 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 15:18:21 +00:00
aaron	c3c001e02e	cleanup of the traversal output code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 06:18:10 +00:00
ebanks	697d7e02c8	Remove the lazy initialize functionality. When no calls are made by the genotyper, we still want a vcf file to be output with valid header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2024 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 02:14:50 +00:00
depristo	6c9f86bb4d	Removed unnecessary output and added debugging print() routine git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2020 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-11 18:37:36 +00:00
hanna	2cf9670d1e	Allow users to directly specify filters from the command-line, applicable to any walker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2012 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 18:40:16 +00:00
ebanks	6a37090529	Output changes for VCF and UG: 1. Don't cap q-scores at 99 2. Scale SLOD to allow more resolution in the output 3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 16:31:31 +00:00
depristo	7e30fe230a	oops, missing file git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2009 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 13:25:18 +00:00
aaron	2ed423ed56	print the current location in read walkers (in addition to the number of reads processed), along with some refactoring to support the change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2006 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 05:57:01 +00:00
ebanks	c9c3cf477a	Based on feedback from Kiran, we know uniquify sample names as sample.rodName (instead of sample.1, sample.2, ...) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2005 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 02:41:37 +00:00
ebanks	3793519bd4	-Added convenience method to VCF record to tell if it's a no call and have rodVCF use it before querying for info fields -Don't restrict info fields to 2-letter keys [about to move these to core] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2002 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 20:52:51 +00:00
ebanks	74751a8ed3	-Some minor fixes to get accurate vcf record merging done -Improvement to snp genotype concordance test And with that, it looks like I get revision #2000. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2000 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 06:40:55 +00:00
ebanks	bc6f24e88f	Added VCFUtils which contains some useful VCF-related functions (e.g. ability to merge VCF records). Also, various minor improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1998 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:53:32 +00:00
ebanks	cff645e98b	convenience method to deal with genotypes that are unsorted (e.g. CA vs. AC) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1997 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 04:45:49 +00:00
ebanks	6fdfc97db6	Added optional field DP to VCF output for Mark. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1981 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-06 20:03:22 +00:00
depristo	5d5dc989e7	improvements to VCF and variant eval support of VCF -- now listens to the filter field git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1963 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 12:09:30 +00:00
ebanks	3a33401822	2nd stage of the genotyper output refactoring is complete. Now, all output is generalized and all of the intelligence lies where it is supposed to. Next stage is syncing up old and new models and making sure we're outputting exactly what we should. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1960 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-02 22:43:08 +00:00

... 4 5 6 7 8 ...

937 Commits (96fe540d667328459e2e4fab74ffffe6ce2f39d6)