gatk-3.8

Commit Graph

Author	SHA1	Message	Date
hanna	2953c9f069	Efficiency improvement requested by the Picard team in IndexedFastaSequenceFile: improve the memory efficiency (and loading time) of long reference sequences by better controlling the input buffer size. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3665 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-29 07:22:07 +00:00
delangel	ed71e53dd4	1) Initial complete version of VCF4 writer. There are still issues (see below) but at least this version is fully functional. It incorporates getting rid of intermediate VCFRecord so we now operate from VariantContext objects directly to VCF 4.0 output. See VCF4WriterTestWalker for usage example: it just amounts to adding vcfWriter.add(vc,ref.getBases()) in walker. add() method in VCFWriter is polymorphic and can also take a VCFRecord, lthough eventually this should be obsolete. addRecord is still supported so all backward compatibility is maintained. Resulting VCF4.0 are still not perfect, so additional changes are in progress. Specifically: a) INFO codes of length 0 (e.g. HM, DB) are not emitted correctly (they should emit just "HM" but now they emit "HM=1"). b) Genotype values that are specified as Integer in header are ignored in type and are printed out as Doubles. Both issues should be corrected with better header parsing. 2) Check in ability of Beagle to mask an additional percentage of genotype likelihoods (0 by default), for testing purposes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3664 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 23:54:38 +00:00
hanna	3a9d426ca8	Added hasPileupBeenDownsampled() boolean to ReadBackedPileup, so that a pileup can report whether or not (but not how much) it's been downsampled. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3649 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 04:56:33 +00:00
hanna	003dd4de3e	Rev Picard with performance enhancements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3615 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 22:54:23 +00:00
bthomas	9d6a341d15	Fixing the error messages thrown with bad interval arguments. I simplified the exception handling and made the messages more verbose. Note: the -L argument takes both interval strings and filenames. If you specify an interval string that is also a file, an error will be thrown to move the file: ie. if you have a file "chr1" in the parent directory, GATK will ask you to move/delete it. But, this only happens with interval string arguments, NOT with intervals that are contained in files, which is a majority of the use case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3602 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:49:41 +00:00
bthomas	300a18b85f	Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually. Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time. Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-21 21:42:42 +00:00
hanna	1d50fc7087	Misc bug fixes: fix tracking of nInsertions with sample-split pileup constructor. Fix performance issue building up pileups from pileups of individual sample data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3598 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 20:32:27 +00:00
hanna	f18ac069e2	A refactoring / unification of ReadBackedPileup and ReadBackedExtendedEventPileup. Provides a cleaner interface with extended events inheriting all of the basic RBP functionality. Implementation is still slightly messy, but should allow users to provide separate implementations of methods for sample split pileups and unsplit pileups for efficiency's sake. Methods not covered by unit/integration tests have not been sufficiently tested yet. Unit tests will follow this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3597 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-20 04:42:26 +00:00
hanna	52477bd9e6	Add some missing methods to the pileup architecture. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3588 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 15:03:08 +00:00
aaron	b978d5946b	adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 08:23:23 +00:00
hanna	612c3fdd9d	First pass at eliminating the old sharding system. Classes required for the original sharding system are gone where I could identify them, but hierarchies that split to support two sharding systems have not yet been taken apart. @Eric: ~4k lines. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 20:17:31 +00:00
hanna	db1383d0b2	Rev the latest version of Picard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3575 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 23:55:07 +00:00
ebanks	f003703912	Allow specification of particular rods for pulling out sample names. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3570 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 19:37:09 +00:00
asivache	671ac00748	A simple utility class that implements a merging Iterator<GenomeLoc> built over an interval or bed file (this is NOT a rod, but rather a direct line-by-line file reader that converts strings to genome locs on the fly and merges overlapping intervals) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3546 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:54:37 +00:00
asivache	7b7d3341f0	trivial refactoring: isFile renamed to isIntervalFile and made public git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3541 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 14:02:23 +00:00
hanna	c3b68cc58d	Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept independent while processing, and only merged back in a priority queue if necessary in a special variant of the ReadBackedPileup. This code is not live yet except in the case of naive deduping. Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and not well integrated with StratifiedAlignmentContext or the walkers. Cleanup to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-13 01:47:02 +00:00
asivache	e6d8faf293	making 'parseLocation' public static - as simple as the logic is, it's better kept in one place and I need it! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3537 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-11 18:19:59 +00:00
weisburd	3b375cb237	Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) - attempt 2 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3529 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:54:36 +00:00
bthomas	99b684ea89	Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:10:23 +00:00
aaron	6941c81bfa	reverting revision 3522 to the old code until we fix the tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3524 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 19:25:02 +00:00
weisburd	adc4c4e577	Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3522 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 18:11:43 +00:00
aaron	ad98512f6c	adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:21:12 +00:00
ebanks	9b2fcc4711	Refactoring of the annotation system: 1. VA is now a ROD walker so it no longer requires reads (needs a little more testing) 2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs) 3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:05:51 +00:00
depristo	e2b41082af	GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 22:26:32 +00:00
ebanks	4a555827aa	Removing more toUpperCase sanity checks git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3471 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 14:38:39 +00:00
depristo	2b02324587	Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-30 18:00:12 +00:00
aaron	871cf0f4f6	Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of: @Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class)) you'd say: @Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class)) Which is more in-line with what was done before. All instances in the existing codebase should be switched over. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-28 14:52:44 +00:00
depristo	f2e7582cfc	Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-27 23:16:00 +00:00
aaron	cded9ec985	adding a command line option, -etd (enable threaded debugging), that uses a custom thread pool class to catch exceptions thrown inside of a thread. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3450 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-27 21:57:56 +00:00
depristo	dfc36c1e95	Restructuring of the mandatory read filters for traversals. Now everything uses ReadFilters, even for the required filters like being mapped for LocusWalkers. Statistics now tracked for each read filter used during the traversal and info emitted in INFO at the end. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3445 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-26 22:12:25 +00:00
depristo	5928047d8b	Optimization of reference window calculation to us bytes not char and no uppercasing since reference and read bases are always uppercase now. Should remove some ~5% of runtime of UG. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3438 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-26 14:10:26 +00:00
ebanks	ae6c014884	Fixed UG parallelization bug. Better integration test to catch this in the future. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3432 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-25 21:03:45 +00:00
ebanks	772f558ae0	Massive change to the indel realigner code. We now properly deal with soft-clipped reads. Also, improved left-alignment code. Small change for Ryan to get hard-clipped reads working for the recalibrator. PLEASE DO NOT RELEASE THIS WEEK. I still have some more testing to do and need Mark to run WG jobs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3430 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-25 20:04:33 +00:00
depristo	a10fca0d5c	Genotyper now is using bytes not chars. Passes all tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3406 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 21:02:44 +00:00
aaron	b543dd4ac4	more aggressive checks for the locking, and some more documentation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3404 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 16:16:36 +00:00
depristo	727822adb4	BaseUtils has more clear distinction between byte and char routines. All char routines are @Depreciated now. Please use bytes. Better organization of reverse(), now in Utils not BaseUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3400 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 14:05:13 +00:00
depristo	6ce3835622	Removing unused methods in QualityUtils; ReferenceContext now converting all bases to upper case, but can be disabled with static boolean git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3399 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 12:38:06 +00:00
depristo	5abac5c057	A few more char -> byte cleanups git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3398 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-20 00:02:06 +00:00
depristo	8a725b6c93	Restructuring of ReferenceContext and ReadWalkers to accept a ReferenceContext. Now ReferenceContext is byte[] backed not char[]. Please no more chars for the reference. All of the tests pass now. Coming check-ins are going to clean up the char / byte problems in the GATK git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3397 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 23:27:55 +00:00
hanna	017ab6b690	Experimental versions of downsampler and Ryan's deduper are now available either as walker attributes or from the command-line. Not ready yet! Downsampling/deduping works in a general sense, but this approach has not been completely optimized or validated. Use with caution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 05:40:05 +00:00
weisburd	2f3933148d	Added fast split(str, delimiter) methodf git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3384 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-19 03:37:26 +00:00
aaron	7cfb9ff3dc	updates for Tribble 82, fixes for Ryans case where multiple processes would attempt to read/write to the same index, and a couple other Tribble-centric bug fixes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3382 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-18 19:34:45 +00:00
hanna	0791beab8f	Checking in downsampling iterator alongside LocusIteratorByState, and removing the reference implementation. Also implemented a heap size monitor that can be used to programmatically report the current heap size. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3367 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-17 21:00:44 +00:00
aaron	2c55ac1374	fixes for parallel processing problems with Tribble, a small bug in the resource pool, and some more documentation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3349 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-12 06:13:26 +00:00
hanna	76efa757f0	Switched over to reviewed version of Picard patch. In process, did some optimization to the IntervalSharder which improved startup time 5-10x when dynamically merging many BAMs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3331 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-08 14:12:22 +00:00
depristo	504103bd15	Misc. additions to correct utilities git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3329 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 21:34:18 +00:00
aaron	06ea65e60b	again for JIRA GSA-320 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3319 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 03:47:58 +00:00
aaron	ac9b32db88	a bug fix for Kiran; putting JIRA in for better type determination system for the new Tribble tracks so this doesn't happen again. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3318 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 03:31:43 +00:00
hanna	4e0019b04f	Repair code that sorts and merges intervals. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3317 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 22:37:25 +00:00
ebanks	0e58fb7cc0	Moved over to be a walker inside the GATK git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3313 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 18:28:03 +00:00
aaron	78409dca0d	turned off the progress output from tribble when making an index, and fixing a case where the index file isn't writable so we instead make the index in memory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3312 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 16:36:58 +00:00
ebanks	bacc507a48	Don't worry about sorting anymore in the liftover tool. That will come later. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3311 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 15:00:30 +00:00
ebanks	2975e3a4e8	picard Intervals don't sort right - switching to GenomeLocs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3308 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-06 03:50:28 +00:00
ebanks	1a99fb9318	First pass at liftover tool. Passing buck over to Aaron... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3306 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 20:38:19 +00:00
aaron	a0d71540df	speed-up for VCF, adding code to the VCF reader to automagically make an index if one doesn't already exist, and a change to the VCF writer unit test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3305 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 20:19:42 +00:00
aaron	6bbcc47b5d	removing some out-of-date RODs and some unused genotype writer formats git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3304 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 19:07:13 +00:00
aaron	a68f3b2e9c	VCF moved over to tribble. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3302 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 17:28:48 +00:00
ebanks	64640d6b17	Complete the switch statement to deal with all possible cigar operators for Kris. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3299 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-05 13:41:05 +00:00
weisburd	8b2ce128b5	Optimized the join(..) method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3280 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-30 15:55:07 +00:00
aaron	64c5f287c5	fixes for edge-cases when using reflections to find classes outside of the main jar. Will push as a patch to reflections git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3264 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 17:46:46 +00:00
aaron	c647153b10	Adding Jama for Ryan. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3262 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 14:30:36 +00:00
aaron	f6468f9143	a fix for a bug we've worked around in the reflections package: previously it didn't find classes that weren't in the main jar. Fixed in this version. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3261 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-27 04:49:49 +00:00
ebanks	42bcca1010	Pulling out the left-alignment code for indels so that other walkers can use it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3251 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-23 16:12:34 +00:00
aaron	536f22f3bd	adding VC adaptor for GELI, along with unit tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3243 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-23 05:28:39 +00:00
hanna	32d86cf457	Rev the reservoir downsampler to support partitioning through a functor. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3232 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-21 19:50:26 +00:00
asivache	1373fee278	Because of the ugly VCF format, generic addCall() method of GenotypeWriter interface acquired an additional parameter, explicitly specified reference base (in VCF it's the base immediately before the event in case of indels, so we got to pass it). All implementing classes are modified to accomodate the change. VCFGenotypeWriterAdapter now explicitly uses the passed reference base instead of deriving it from VatriantContext (in SNP mode as well!), other writers simply ignore that additional argument. SimpleIndelCalculationModel now WORKS (or rather, it does produce calls :) ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3228 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-21 18:19:03 +00:00
asivache	6fda78f93f	Always return deleted bases in upper case git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3218 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 19:17:40 +00:00
asivache	52a570637d	Always keep event bases in upper case git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3217 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 19:16:39 +00:00
aaron	80c4f88a72	removing the Variation interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3216 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 18:56:45 +00:00
hanna	c1e53d407d	The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had unicode quote characters embedded in it. These characters were invisible inside IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason have a different default charset. Fixed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 15:26:32 +00:00
aaron	b5f6f54968	Almost done removing any trace of the old Variation and Genotype interfaces. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3202 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 14:52:15 +00:00
hanna	1bc26f69e9	An attempt to cleanup the Utils directory. Email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-19 23:00:08 +00:00
hanna	c08936d6f4	Added a reservoir downsampler which can sample elements in an iterator uniformly from a stream (see Vitter 1985). Thanks to Eric and Andrey for the pointer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3197 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-19 20:48:14 +00:00
aaron	e11ca74eb5	removing some outdated ROD classes (PooledEMSNPROD and SangerSNPROD), removing an out-of-date interface (VariantBackedByBenotype), and moving AnalyzeAnnotationWalker over to VariationContext. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3188 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-16 18:59:29 +00:00
asivache	6dc1275cfb	Utility method added: getQualsInCycleOrder(read) - examines the read and returns its quals in the order the machine read them (i.e. always from cycle 1 to cycle N). Simply inverts quals if the read happens to be rc-aligned :) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3183 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-16 00:15:57 +00:00
aaron	e682460c1f	add a fix so that XL arguments won't cancel out -BTI arguments, fixed a bug for Ben where the ROD -> interval list conversion was throwing an exception, and some old code removal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3174 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-15 16:31:43 +00:00
hanna	8573b0bc6f	Refactoring intervals, separating the process of parsing interval lists, sorting and merging interval lists, and creating RODs from intervals. This gives Doug the ability to keep using our interval list parsing code when sorting intervals on our behalf. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3159 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-13 15:50:38 +00:00
ebanks	3f2455e346	Better error message as suggested by James P git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3141 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-09 05:52:53 +00:00
aaron	12e4f88ca7	a little bit more clean-up git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3122 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-05 20:49:06 +00:00
aaron	df7e7921ce	removing some unused code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3121 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-05 19:30:08 +00:00
bthomas	b4f6f54502	Reorganizing the way interval arguments are processed Most of the changes occur in GenomeAnalysisEngine.java and GenomeLocParser.java: -- parseIntervalRegion and parseGenomeLocs combined into parseIntervalArguments -- initializeIntervals modified -- some helper functions deprecated for cleanliness Includes new set of unit tests, GenomeAnalysisEngineTest.java New restrictions: -- all interval arguments are now checked to be on the reference contig -- all interval files must have one of the following extensions: .picard, .bed, .list, .intervals, .interval_list git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3106 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-01 12:47:48 +00:00
aaron	c3c6e632d1	support for two new VCF header info field value-types, Flag (for fields that are just boolean truths), and Character (for single charatcer info fields). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3105 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-01 03:11:32 +00:00
aaron	3d3d19a6a7	the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date. a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects. This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc. This layer is needed so we can integrate Tribble tracks (which don't natively have names). Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc). Sorry for the inconvenience! More changes to come, but this is by far the largest (as has the greatest effect on end users). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-31 22:39:56 +00:00
hanna	400684542c	Revisions to take into account finalization of Picard patch: naming changes, better definition of public interfaces. This won't be the last Picard patch, but it should be the last big one. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3096 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-30 19:28:14 +00:00
hanna	85037ab13f	Fix for Kiran's sharding issue (Invalid GZIP header). General cleanup of Picard patch, including move of some of the Picard private classes we use to Picard public. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3087 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-29 03:21:27 +00:00
depristo	b8ab74a6dc	Minor useful changes to BaseUtils and MathUtils to support a new haplotype score annotation that determines to the two most likely haplotypes over an interval and scores variants by their consistency with a diploid model. Appears to be useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3085 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-28 21:45:22 +00:00
ebanks	47e30aba92	Rods for reads hooked up into the cleaner git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3070 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-24 18:17:56 +00:00
ebanks	49117819f5	For the cleaner to clean, it must beat the entropy produced by the aligner (and not just the raw reads). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3068 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-24 15:21:58 +00:00
aaron	a69b8555dd	Geli to variant context. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3063 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-23 06:45:29 +00:00
aaron	eafdd047f7	GLF to variant context. Added some methods in GLF to aid testing; and added a test that reads GLF, converts to VC, writes GLF and reads back to compare. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3062 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-23 03:43:25 +00:00
hanna	3767adb0bb	Processing intervals as they stream in means much lower memory usage and quicker runtime. Making change as minimal as possible to avoid conflicts with BT's incoming patch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3061 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 22:04:45 +00:00
ebanks	0097106938	VariantFiltration can now filter specific samples. This is NOT an ideal implementation. One day when we have lots of free time (or a greater desire), we will implement this correctly and sophisticatedly using all the power of JEXL. For now, though, this will have to do. Docs coming tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3060 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 20:45:11 +00:00
depristo	076d21d394	Minor bug workaround in GenotypeConcordance module (see todo). General platform read filter. You can say -rl Platform illumina to remove all SLX reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3054 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-22 02:47:09 +00:00
ebanks	c88a2a3027	Fixing/cleaning up the vcf merge util git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3047 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 15:13:32 +00:00
depristo	56092a0fc2	Slight cleanup for mathutils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3042 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 13:18:08 +00:00
ebanks	03480c955c	And now the UnifiedGenotyper can officially annotate genotype (FORMAT) fields too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3039 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 04:58:37 +00:00
ebanks	e757f6f078	Missing value for arbitrary format entries is empty string (need to revisit at some point, but it will require updating the VCF spec). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3038 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 03:56:27 +00:00
ebanks	0311980668	The VariantAnnotator can now officially annotate genotype (FORMAT) fields. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3037 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 03:30:14 +00:00
ebanks	ee0e833616	Some significant changes to the annotator: 1. Annotations can now be "decorated" with any arbitrary interface description - not just standard or experimental. 2. Users can now not only specify specific annotations to use, but also the interface names from #1. Any number of them can be specified, e.g. -G Standard -G Experimental -A RankSumTest. 3. These same arguments can be used with the Unified Genotyper for when it calls into the Annotator. 4. There are now two types of annotations: those that are applied to the INFO field and those that are applied to specific genotypes (the FORMAT field) in the VCF (however, I haven't implemented any of these latter annotations just yet; coming soon). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3029 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-18 05:38:32 +00:00
rpoplin	58a31bab6a	Variant optimizer now outputs VCF files via ApplyVariantClustersWalker. Documentation to be added to the wiki. It is ready to be used by other people but only with great caution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3028 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 20:41:42 +00:00
hanna	d9398dc347	Remove some of the restrictions on getStart() and getStop(); getStart() and getStop() now do the minimum validation rather than the more rigorous only-within-the-contig-bounds header validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3027 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 19:39:30 +00:00
ebanks	ded4ba8966	Let's make artificial reads that actually adhere to the specs... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3022 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:51:42 +00:00
bthomas	5b34bb9ab0	Adding three minor new features: + -L all now walks over all intervals + if a -L argument is passed with a .list extension, and file does not exist, returns a \ File Not Found error instead of "bad interval" error. We plan to soon revisit interval \ lists and generate a concrete list of filenames, so this is likely temporary. + Error is thrown if the start position on an interval is higher number than the end position. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3021 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:24:10 +00:00
ebanks	4340601c26	-Pushed base quals back down into SAMRecord; if -OQ is used, the SAMRecord quals get updated automatically -Better integration test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3020 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 16:00:10 +00:00
ebanks	1fd909cdaf	Fix for Kiran: -1 is a valid value for genotype qualities in VCF, so VariantContext shouldn't die. Cleaned up the relevant VCF code while I was in there. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3015 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-17 00:20:15 +00:00
ebanks	586f87fa35	Quick fix git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3007 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-16 02:59:26 +00:00
ebanks	202231141c	-Push the --use_original_qualities argument into the engine. -Check that base and qual strings are the same lengths -Fix one more bug in the clipper. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3006 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-16 02:06:11 +00:00
ebanks	411d25c8d1	-Integration tests for walkers that use original quals. -framework for pushing -OQ into GATK (not done) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3004 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-15 18:46:31 +00:00
kcibul	9f519af06d	new method to filter out overlapping PE reads git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3002 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-15 15:40:09 +00:00
depristo	4dd7c5972c	Unit tests for -XL arguments; expt. annotation calculating the GC content within 100 bp of the current SNP git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2997 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-14 21:08:14 +00:00
aaron	ecb59f5d0d	removed old tests and old code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2995 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:57:01 +00:00
depristo	e7eae9b61d	High performance, correct implementation of -XL exclusion lists. Enjoy. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2994 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:39:20 +00:00
aaron	88a48821ea	removed the dependence on removeRegion() in GenomeLocSortedSet git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2993 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 22:35:49 +00:00
aaron	1eb5f97255	fixed dropping single base intervals from deleteRegion, moving onto performance fixes. (stop - start is length-1 on closed intervals, so we need to check greater than OR equals to zero) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2990 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-12 19:14:21 +00:00
hanna	a7ba88e649	Rework the way the MicroScheduler handles locus shards to handle intervals that span shards with less memory consumption. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-11 18:40:31 +00:00
aaron	dde9fd8a15	some rods-for-reads cleaning and performance improvements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 22:54:58 +00:00
depristo	486bef9318	Support for validationRate calculation in variant eval 2; better error messages for failed genome loc parsing; tolerance to odd whitespace in plinkrod, and fix for monomorphic sites in vcf2variantcontext. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2976 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 16:25:16 +00:00
ebanks	c85ed1ce90	Plumbing is now in place to emit indel calls from the UnifiedGenotyper. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2975 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-10 04:30:12 +00:00
ebanks	5a20bf0e64	3 changes to UG which break integration tests: 1. emit AA,AB,BB likelihoods in the FORMAT field for Mark 2. remove constraint that genotype alleles (in the GT field) need to be lexigraphically sorted. 3. Add bam file(s) used by genotyper to header for Kiran git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2963 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-09 17:16:47 +00:00
ebanks	9f3b99c11b	Moving UnifiedGenotyper and VariantAnnotator over to VariantContext system. Removing obsolete genotyping classes. First stage of removing dependence on old Genotype class. More changes to come. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2960 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-09 03:41:07 +00:00
hanna	1ef1091f7c	Cleanup and simplification of read interval sharding. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2944 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-05 23:34:38 +00:00
ebanks	0dd65461a1	Various improvements to plink, variant context, and VCF code. We almost completely support indels. Not yet done with plink stuff. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2926 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-04 17:58:01 +00:00
chartl	6759acbdef	Coverage statistics now fully implements DepthOfCoverage functionality, including the ability to print base counts. Minor changes to BaseUtils to support 'N' and 'D' characters. PickSequenomProbes now has the option to not print the whole window as part of the probe name (e.g. you just see PROJECT_NAME\|CHR_POS and not PROJECT_NAME\|CHR_POS_CHR_PROBESTART-PROBEND). Full integration tests for CoverageStatistics are forthcoming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2924 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-04 15:00:02 +00:00
aaron	ca2cd9d4f5	a little clean-up: move setting the bases of generated reads into Artificial SAM Utils now that the clean read injector test is gone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2919 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 16:31:45 +00:00
aaron	790d2a7776	adding the initial ROD for Reads support; more convenience methods in ReadMetaDataTracker to come. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2918 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 15:56:44 +00:00
ebanks	0e9a6826b0	Update to VCF code to get it up to spec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2917 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-03 06:12:42 +00:00
ebanks	5f3c80d9aa	1. To make indel calls, we need to get rid of the SNP-centricity of our code. First step is to have the reference be a String, not a char in the Genotype. Note that this is just a temporary patch until the genotype code is ported over to use VariantContext. 2. Significant refactoring of Plink code to work in the rods and use VariantContext. More coming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2913 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-02 20:26:40 +00:00
kcibul	7578678f99	refactored to provide a sum of mismatch quality scores capability as well (used by Cancer) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2911 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-02 16:40:03 +00:00
aaron	246fa28386	RODs for reads phase 2: modified RODRecordList to implement List<ReferenceOrderedDatum> so I could stub it out for testing, added a FlashBackIterator which is needed to prevent the ResourcePool from opening infinity+1 iterators, and some other interfaces to make unit testing much smoother. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2892 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-25 22:48:55 +00:00
hanna	199b43fcf2	Reduce by interval alterations to interface with new sharding system. This checkin with be followed by a simplification of some of the locus traversal code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2886 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-25 00:16:50 +00:00
aaron	fef1154fc8	starting on RODs for Reads: made RODRecordList implement list<RODatum> (so we can sub in fake lists during testing), and removed unnecessary generic-ness. Removed BrokenRODSimulator, which isn't being used. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2884 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-24 22:11:53 +00:00
aaron	5546aa4416	adding code to deal with the off-spec situation where our minimum likelihood is above the GLF max of 255. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2871 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-22 22:27:39 +00:00
alecw	b236714c8a	Optimization - Added method to Covariates: void getValues( SAMRecord read, Comparable[] comparable ) which takes an array of size (at least) read.getReadLength() and fills it with covariate values for all positions in the given read. Made CovariateCounterWalker and TableRecalibrationWalker use this method instead of calling getValue(..) for each covariate and each offset. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2863 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-22 17:35:25 +00:00
aaron	33ae256186	a start to some of the infrastructure for Tribble, including dynamic detection of new RMD; not nearly wired in or complete yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2855 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-18 18:43:52 +00:00
ebanks	79ab7affda	- Change sortOnDisk option to sortInMemory - Fix horrible cleaner bug - Trivial optimizations to cleaner code - more significant ones coming soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2850 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-17 20:52:57 +00:00
aaron	653f70efa2	added methods to validate an interval before you try to make a GenomeLoc: boolean validGenomeLoc(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2846 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-16 20:35:35 +00:00
rpoplin	3de72daa88	Removing an accidently added import statement. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2818 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-10 15:54:24 +00:00
rpoplin	0b1e243a7b	CountCovariates now sorts the list of standard covariate classes coming from PackageUtils.getClassesImplementingInterface(). As a result some of the integration tests now make use of -standard git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2817 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-10 15:52:20 +00:00
depristo	934d4b93a2	VariantContext to VCF converter. BeagleROD, and phasing of VCF calls. Integration tests galore :-) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2814 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-09 19:02:25 +00:00
depristo	94f892ad42	VCF->beagle and VCF phasing using beagle input. Appears to work fairly well. VariantContexts now support phased genotypes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2812 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-09 01:22:05 +00:00
kshakir	fc810a1800	Updated VCF Reader to parse VCFs according to the VCFv3.3 spec. Column headers are tab separated since sample names might have spaces. Updated test files in /humgen/gsa-scr1/GATK_Data/Validation_Data/*.vcf to remove spaces except for when they are supposed to be in the sample name. Added @Test before VCFReaderTest.testHeaderNoRecords() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2809 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-08 22:55:59 +00:00
hanna	21369869b7	Extend regex that supports every 'word' character to use any printable character except ':'. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2807 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-08 03:29:55 +00:00
depristo	af8c47fc2f	Fixing up testVariantContext for integration tests for variant context. Printing of VCs and genotypes now stable using sorting. Cleaned up comments in quality score by strand. RefMetaDataTracker now directly allows walkers to obtain VariantContexts using the simple Collection<VariantContext> getAllVariantContexts(GenomeLoc curLocation, EnumSet<VariantContext.Type> allowedTypes, boolean requireStartHere, boolean takeFirstOnly) function. VCF and dbSNP VariantContexts now officially supported. Other importan types can be added to the adapator system in refdata package. Integration tests later today git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2791 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-05 15:42:54 +00:00
ebanks	83b9d63d59	1. Added functionality to the data sources to allow engine to get mapping from input files to (merged) read group ids from those files. 2. Used said mapping to implement N-way-in,N-way-out functionality in the new indel cleaner. Still needs more testing (to be done after vacation but preliminary tests look good). 3. Fixes to VCF validator: ignore case when testing VCF reference base against true reference base and allow quals of -1 (as per spec). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2773 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-04 04:12:49 +00:00
chartl	2c4f709f6f	Bunch of oneoff stuff that I don't want to lose. Also: VCFRecord - "." dbsnp-ID entries now taken into account (thought these were represented as null; but I guess not) VCFGenotypeRecord - added a replaceFormat option; since intersecting Broad/BC call sets required genotype formats also be intersected (no changing on-the-fly) VCFCombine - altered doc to instruct user to give complete priority list (was throwing exception if not) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2760 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 21:35:10 +00:00
asivache	421282cfa3	Convenience method: getMappingFilteredPileup(int minMapQ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2759 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 21:19:53 +00:00
depristo	d9671dffba	Documentation for VariantContext. Please read it and start using it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2756 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-01 17:49:51 +00:00
chartl	236764b249	Major (and useful) changes to MultiSampleConcordance: 1) Now cares about Genotype filtering. If it is flagged as filtered, it can count as a FP/FN/TP; but goes into a "non-confident genotype" bin, rather than het/hom. 2) Can give it a Genotype Confidence flag (-GC) which will automatically filter genotypes in the way above for quality > Q for "-GC Q" 3) Can give it an -assumeRef flag. For sites only in the truth VCF (that don't even appear in the variant VCF), that locus will be treated as confident ref calls for all individuals in the variant VCF; and the calculators updated accordingly. *** Important: Default behavior is that sites unique to the truth VCF are considered no-call sites for the variant. This flag can help get aroudn that; however the safest way to run this is to have a variant VCF with calls at each and every locus, if that is possible. VCFGenotypeRecord -- added an isFiltered() call to automate looking up the FILTERED flag for VCF v3.3 SimpleVCFIntersectWalker - basic outline for a walker I'm working on tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2747 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-30 01:18:31 +00:00
aaron	ac2a207b0b	added a wrapper exception for anything that goes wrong in VCF parsing; this way the problematic file line is emitted, no matter what happens. Makes debugging a lot easier, especially in large files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2739 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 19:58:51 +00:00
chartl	d57a86ad41	Not nearly as badass as it looks. The problem I mentioned yesterday with "bleeding in" of samples comes from VCFUtils and SampleUtils looking for all VCF-class RODs in the tracker, and stealing the name from them. I have introduced a new HapmapVCF - type rod for use when you want to protect your VCF header from being infected by the samples in a bound hapmap VCF. Changes are as follows: VCFRecord - minor change to adapt isNovel() to the case where the dbsnp ID field is empty, but the info field has DB=1 HapmapVCFRod - introduced for the reason at the top RODRecordIterator - was: catch ( Exception e ) { throw new StingException("long ass message") } is now: catch ( Exception e ) { throw new StingException("long ass message",e) } to permit full stack ejaculation. RodVCF - Now with more brackets! ReferenceOrderedData - registering HapmapVCF as a bindable string VariantAnnotator - There's an extra space on a line. And some new brackets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2733 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 15:19:50 +00:00
hanna	3d922a019f	Basic support for very simple index-driven locus traversals. Interface has been changed to support batched intervals in a single shard, but intervals are not yet compressed into a single shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 03:14:26 +00:00
chartl	7a10c40fb3	Much clearer (and, like, not totally incorrect) implementation of isNovel git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2725 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 21:16:21 +00:00
chartl	8de6a8d246	Lots of changes; all to do something relatively minor. 1) Changed VCF/RodVCF to allow for inquiries to whether or not the site is novel; isNovel() looks at the ID field, and those members of the info field that indicate membership in dbsnp, hapmap2, or hapmap3; and if none can be found, returns true. 2) Changed VariantAnnotator to annotate hapmap2 and hapmap3, if you bind rods to it with those names. Works in the same way as DBSNP does -- if you give it a rod named "hapmap2" it'll annotate membership in it. -- Passes integration tests 3) Changed UnifiedGenotyper to do the same thing (since it uses Annotations as a subroutine) -- Passes integration tests 4) Changed MultiSampleConcordanceWalker to take a flag --ignoreKnownSites (or -novels) to examine concordance only on sites that are not marked as in dbSNP or in Hapmap in the variant VCF 5) Changed VCFConcordanceCalculator (the object MultiSampleConcordanceWalker runs on) to output Concordant_Het_Calls and Concordant_Hom_Calls separately, rather than combined as Concordant_Calls 6) AlleleBalanceHistogramWalker -- I don't know what i did to this thing. I've been jerry rigging System.outs to do stuff it was never really intended to do; so there's probably some dumb System.out.print("HI I AM AT LOCUS:"+loc) stuck somewhere. It compiles at any rate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2724 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 21:06:56 +00:00
depristo	956b570c8e	V5 improvements to VariantContext. Now fully supports genotypes. Filtering enabled. Significant tests throughout system. Support for rebuilding variant contexts from subsets of genotypes. Some code cleanup around repository git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2721 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 18:37:17 +00:00
ebanks	1dd9996f3a	New realigner now completely uses bytes, plus misc fixes. Still not ready for use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2719 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-28 04:17:20 +00:00
ebanks	fddca032bb	Initial commit of v2.0 of the cleaner. DO NOT USE. (this means you, Chris) Cleaned up SW code and started moving over everything to use byte[] instead of String or char[]. Added a wrapper class for SAMFileWriter that allows for adding reads out of order. Not even close to done, but I need to commit now to sync up with Andrey. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2712 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-27 21:36:42 +00:00
hanna	fa3589e5c5	Update our error messages to point to getsatisfaction.com/gsa. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2706 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-27 19:16:28 +00:00
hanna	022601b1a5	Warnings for walkers w/o Javadoc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2683 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-25 20:34:50 +00:00
hanna	d25a2fe120	Better handling of enums by the command-line argument system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2647 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 21:36:46 +00:00
hanna	1e9fe2a334	Clean up error output when enums have missing arguments. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2645 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:48:26 +00:00
aaron	8d1d37302c	a quick change to GLF to keep as much precision in our likelihoods as long as possible, before we put it into byte space. Sanger was doing a diff at low coverage and noticed our calls didn't contain as much precision as theirs. Updated the MD5 for unified genotyper output. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2644 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:36:49 +00:00
hanna	908d399670	Bug fix for help text / version number - help text retriever was crashing in the debugger if help text hadn't been built. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2643 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-20 19:18:19 +00:00
hanna	8dafd26100	Print out the current version number in the application header. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2633 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-19 21:58:36 +00:00
hanna	1488578617	Working with Aaron to get svnversion running within the build system. This change will break the build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2628 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-19 16:55:42 +00:00
depristo	41392f8ff5	functions for setting gentoype records and alternate bases; function for getting all rods implementing VCF git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2611 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-16 20:19:43 +00:00
hanna	ac4756db20	Add the svn version on the fly to the version number properties. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2607 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-16 00:28:01 +00:00
hanna	420cef4094	Added version numbers to the help doclet extractor. Since the help system is behaving more like a resource bundle at this point, changed it over to use the Java ResourceBundle support classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2606 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 23:31:29 +00:00
hanna	930082314a	Put a major.minor version into the GATK Javadoc for reading. Also, update some straggler packages to the new package-info.java format introduced in 1.5. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2604 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 21:48:30 +00:00
ebanks	b911b7df82	Fixing the AC annotation to be in line with the VCF spec git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2593 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 18:28:52 +00:00
rpoplin	70df30fc1b	Added method to AlignmentUtils which takes a read's cigar and the refBases char array given to a ReadWalker and returns the aligned reference char array. Bug fix in solid_recal_modes to use this aligned reference array. Recalibrator version number is no longer separate for each of the two walkers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2589 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 15:36:59 +00:00
ebanks	2a116bb5d6	Made the VCF validator a simple rod walker instead of having it be in a separate package. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2588 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 06:39:06 +00:00
aaron	db9570ae29	Looks bigger than it is: * Moved GATKArgumentCollection into gatk.arguments folder to clean up the main folder, also added some associated argument classes (most of the changes). * Added code the argument parsing system for default enums, we needed this so we could preserve the current unsafe flag, and at the same time allow finer grained control of unsafe operations. You can now specify: "-U" (for all unsafe operations), "-U ALLOW_UNINDEXED_BAM" (only allow unindexed BAMs), "-U NO_READ_ORDER_VERIFICATION", etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2586 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 00:14:35 +00:00
asivache	d85461c463	MergingIterator completely re-done. Now it is not a generic class (sorry guys), but rather it is tailored for merging ROD tracks. This implementation peeks the locations of next ROD annotations in each track, but does not actually read these RODs from underlying streams until the location is reached and it is time to actually return the object. Now underlying ROD track iterators (registered in the resource pool!) are not advanced prematurely past the current position and all the way to the next ROD record wherever it is, so that the sharding system can reuse them. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2582 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-14 17:43:36 +00:00
ebanks	a082b948a3	Support throughout for S and N cigar elements. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2579 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-14 03:45:42 +00:00
ebanks	8ca5bba738	We emit genotype data in the VCF record if the format string instructs us to (regardless of whether or not genotypes are provided - this was the wrong test). SequenomToVCF now correctly has no-calls when probes fail. Re-enabled SequenomToVCF integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2572 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 15:40:27 +00:00
chartl	6d1107a4ed	Update to SequenomToVCF Output changing slightly so integration test disabled temporarily git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2571 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 15:32:05 +00:00
ebanks	f99586f91b	Added integration test for beagle and verbose output in UG. Minor cleanup of VCFRecord code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2570 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-13 03:55:24 +00:00
ebanks	040fdfee61	Cleaned up the interface to VCFRecord. It's now possible (and easy) to create records and then write them with a VCFWriter. I've updated HapMap2VCF to use the new interface; Chris agreed to take care of Sequenom2VCF. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2558 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-11 21:42:12 +00:00
chartl	dfa3c3b875	Added: SequenomToVCF - Takes a sequenom ped file and converts it to a VCF file with the proper metrics for QC. It's currently a rough draft, but is working as expected on a test ped file, which is included as an integration test. Modified: VCFGenotypeCall -- added a cloneCall() method that returns a clone of the call Hapmap2VCF -- removed a VCFGenotypeCall object that gets instantiated and modified but never used (caused me all kinds of confusion when I was basing SequenomToVCF off of it) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2554 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-11 17:17:21 +00:00
ebanks	971834ca90	Added a walker to the vcf tools compilation: one that combines vcf records. Both merges and unions are supported (see documentation... when it gets written this week). Also, moved some code that pulls samples out of rods from VCFUtils into SampleUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2552 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-10 06:45:11 +00:00
ebanks	b468369dfa	-UG's call into VariantAnnotator now uses the full alignment context (as opposed to the filtered one) -MQ0 annotation is now standard again -Added AC and AN annotations to VCF output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2545 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-08 05:40:42 +00:00
rpoplin	5f58492401	A rogue QualityUtils.MAX_REASONABLE_Q_SCORE managed to get through my previous bug fix. It should instead check the command line -maxQ argument. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2540 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 21:17:39 +00:00
ebanks	9a658e6b18	-Fixed VCF header line bug -Added useful trim() method for Strings for characters other than whitespace git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2538 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 17:51:41 +00:00
ebanks	b643a513bb	Minor interface change for VCFGenotypeRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2537 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 16:48:09 +00:00
depristo	076481f786	Fixes to mergeVCF -- now correctly supports merging of filter fields. Also removed incorrect hasFilteringCodes() function. Updated intergration tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2535 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 14:50:13 +00:00
ebanks	6c739e30e0	1. Removing an old version of the Genotype interface which is no longer being used. Needed to do this now so that the naming conflicts would cease. 2. Adding a preliminary version of the new Genotype/Allele interface (putting it into refdata/ as the VariantContext really only applies to rods) with updates to VariantContext. This is by no means complete - further updates coming tomorrow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2533 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 05:51:10 +00:00
depristo	a9245a58e2	Fix for incorrect exception throwing in VCFRecord. It is reasonable to ask for the non-ref allele freq at all ref sites. Was only passing in tests because isReference was broken git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2532 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 01:18:30 +00:00
depristo	7215526810	Fix to isReference() in VCFRecord. Change to VariantCounter to correctly counter only non-genotype variants, as well as update to VariantEvalWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2531 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-07 00:03:29 +00:00
andrewk	6c4ac9e663	Updated HapMap2VCF to use the VCFGenotypeWriterAdapter interface; fixed bug in VCFParameters that affects VariantsToVCF and HapMap2VCF when reference is lower-cased; added integration test for HapMap2VCF that checks for the lower-case issue by testing against Hg18 region that has lower-cased bases git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2530 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 21:27:11 +00:00
chartl	a32245f7d2	Modifications: QualityUtils - Stole the BaseUtils code for flipping reads around and applied it to quality scores SecondBaseSkew - Nothing's really different, just a commented line Additions (experimental annotations for future development of second-base annotation) I DO NOT INTEND FOR ANYONE TO USE THESE - ProportionOfNonrefBasesSupportingSNP - ProportionOfSNPSecondBasesSupportingRef - ProportionOfRefSecondBasesSupportingSNP + I hope these are self-explanatory - QualityAdjustedSecondBaseLod + Adjust lod-score by 10*log10[P[second bases are as observed]] Added walker: QualityScoreByStrand - oneoff project that's being saved if i ever need it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2527 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 19:18:07 +00:00
asivache	eb899741e1	reverting last changes. no cacheing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2526 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 18:59:37 +00:00
asivache	a17d725c35	Cache pileup bases and mapping quals after first call to getBases() and getMappingQuals(), respectively. Subsequent calls to these method will return cached arrays. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2525 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 18:05:00 +00:00
ebanks	d6fb19bb67	Don't hard-code base qual max git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2524 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 17:21:44 +00:00
depristo	592749a7c1	isNBase method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2513 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 15:01:51 +00:00
depristo	5ce11c3dad	toString method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2512 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 15:01:20 +00:00
depristo	bca3d1b943	useful convenience function to get a genotype associated with a particular sample git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2510 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 14:53:56 +00:00
depristo	ec774f62be	Some checking to protect the BasicGenotype git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2509 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-06 14:53:24 +00:00
ebanks	ed2fff13aa	-Misc improvements to VCF code -Small fix to callset concordance git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2497 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-04 02:28:47 +00:00
ebanks	7b702b086f	You don't need to be bi-allelic to have a non-ref alt allele frequnecy, but you do have to be a variant. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2495 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-03 22:02:39 +00:00
asivache	a41cb0701b	Now can generate verbose String representation of deletions (e.g. "-AAT") if reference bases are provided as an argument to getEventStringWithCounts(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2488 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 21:54:50 +00:00
asivache	89791d730e	Compute and cache the length of the longest deletion observed at the site; ReadBackedExtendedEventPileup now has a getter to access that value. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2487 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 21:19:39 +00:00
rpoplin	80658fd99e	AnalyzeCovariates gets the same performance improvements as the recalibrator. NHashMap class is removed completely. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2483 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 18:10:10 +00:00
rpoplin	9b2733a54a	Misc clean up in the recalibrator related to the nested hash map implementation. CountCovariates no longer creates the full flattened set of keys and iterates over them. The output csv file is in sorted order by default now but there is a new option -unsorted which can be used to save a little bit of run time. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2482 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 16:58:04 +00:00
asivache	8330058216	method added: getEventStringsWithCounts() Returns list of Pairs <String,Integer>, where each pair consists of a unique indel event observed at the site and the total number of observations of that event. String representation for insertions is verbose (e.g. +ACT), while deletions are represented as "5D" (since read backed pileup has no reference information, so we can not get actual sequence of deleted bases) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2479 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:41:58 +00:00
asivache	cf3e59eb4a	back to archive git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2478 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:00:38 +00:00
asivache	295d16572e	synch; will go back to archive in a sec git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2477 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 22:00:03 +00:00
rpoplin	96c4929b3c	Recalibrator now uses NestedHashMap instead of NHashMap. The keys are now nested hash maps instead of Lists of Comparables. These results in a big speed up (thanks Tim!). There is still a little bit of clean up to do, but everything works now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2474 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 21:01:32 +00:00
asivache	f445745c56	Pileup element and corresponding container class tweaked for representing pileups of extended events (indels) at a given locus. There's some redundancy with PileupElement and ReadBackedPileup (should we rename them to BasePileupElement and ReadBackedBasePileup?), so that abstracting a basic interface/abstract base from these classes can be considered in the future git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2469 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 20:03:39 +00:00
depristo	87e863b48d	Removed used routines in duputils; duplicatequals to archive; docs for new duplicate traversal code; general code cleanup; bug fixes for combineduplicates; integration tests for combine duplicates walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2468 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 19:46:29 +00:00
ebanks	5fdf17fccb	Removed the VCF "NS" annotation (which wasn't working for pooled calls anyways) since it's ambiguous and not useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2465 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 17:30:47 +00:00
hanna	e32174fbc4	UnifiedGenotyper now works without -varout or -vf set. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2464 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 16:46:24 +00:00
ebanks	aeb34758e6	Adding a validation stringency to the VCF writers (which defaults to STRICT). If set to SILENT, it will not throw an exception for (reasonable) off-spec requests but will instead ignore such requests and silently move on. This change allows the pooled calculation model to work correctly with multiple threads. Boys, the Genotyper is now officially parallelized. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2462 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-29 15:33:53 +00:00
depristo	fcc80e8632	Completely rewritten duplicate traversal, more free of bugs, with integration tests for count duplicates walker validated on a TCGA hybrid capture lane. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2458 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 23:56:49 +00:00
rpoplin	92e3682991	Moved NHashMap to sting/utils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2452 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 20:57:32 +00:00
ebanks	b1ac4b81d5	Optimization: look up diploid genotypes from a static matrix instead of creating them on the fly (with String.format); bases no longer need to be ordered appropriately git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2448 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 17:28:51 +00:00
ebanks	d2770f380c	Writing calls to standard out now works again (it got broken when we introduced parallelization) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2446 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-27 04:36:45 +00:00
ebanks	0571d9dcb9	Point MAX_QUAL_SCORE to SAMUtils.MAX_PHRED_SCORE. Also, array size for caches should be max score + 1. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2444 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 20:47:32 +00:00
aaron	b134e0052f	added changes to the code to allow different types of interval merging, 1: all overlapping and abutting intervals merged (ALL), 2: just overlapping, not abutting intervals (OVERLAPPING_ONLY), 3: no merging (NONE). This option is not currently allowed, it will throw an exception. Once we're more certain that unmerged lists are going to work in all cases in the GATK, we'll enable that. The command line option is --interval_merging or -im git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2437 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:59:14 +00:00
alecw	159778416c	In TableRecalibrationWalker, update UQ tag if it was present in the original SAMRecord. This required a new sam.jar, which caused some other files to need to be changed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2435 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:42:36 +00:00
hanna	0d890e1bf0	Rework Eric's output management code given that the behavior of the UG changes drastically depending on its output format. Current implementation is probably a bit overkill-ish and we can whittle this down to what's absolutely necessary. Writing VCFs to the 'out' protected printstream may not work at this moment. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2425 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 00:33:43 +00:00
ebanks	cf303810d3	VCF reader now creates the correct type of header line for each header type git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2423 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 20:39:06 +00:00
hanna	b780ffb34a	Add a getFormat() method to get the output format from the writer. The need for this call suggests that I may be thinking about the typing of the GenotypeWriter object the wrong way. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2418 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 01:46:26 +00:00
hanna	11cbfcec9c	Get rid of backlink from ArgumentDefinitions to ArgumentSources. This will help in the future with multiple source -> single definition mapping sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2417 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 00:39:36 +00:00
aaron	7e0f69dab5	Changed the GLF record to store it's contig name and position in each record instead of in the Reader. Integration tests all stay the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2410 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 22:54:56 +00:00
ebanks	4ea31fd949	Pushed header initialization out of the GenotypeWriter constructors and into a writeHeader method, in preparation for parallelization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2406 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 19:16:41 +00:00
ebanks	eeddf0d08e	Adding sample utils for convenience methods to pull out samples from e.g. SAMFileHeader or Genotype objects git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2405 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 18:51:21 +00:00
ebanks	4f59bfd513	Updates to the various GenotypeWriters to make them do simple things like write records (plus allow GLFReader to close). Adding first pass of stub and storage classes for the GenotypeWriters so that UG can be parallelizable. Not hooked up yet, so UG is unchanged. The mergeInto() code in the storage class is ugly, but it's all Tribble's fault. We can clean it up later if this whole thing works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2400 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 07:20:23 +00:00
ebanks	94f5edb68a	1. Fixed VCFGenotypeRecord bug (it needs to emit fields in the order specified by the GenotypeFormatString) 2. isNoCall() added to Genotype interface so that we can distinguish between ref and no calls (all we had before was isVariant()) 3. Added Hardy-Weinberg annotation; still experimental - not working yet so don't use it. 4. Move 'output type' argument out of the UnifiedArgumentCollection and into the UnifiedGenotyper, in preparation for parallelization. 5. Improved some of the UG integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2398 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 04:14:14 +00:00
rpoplin	6fbf77be95	Updating the two solid_recal_mode options to also change the previous base since solid aligner prefers single color mismatch alignments over true SNP alignments. COUNT_AS_MISMATCH mode has been removed completely. The default mode is now SET_Q_ZERO. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2394 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-17 20:07:26 +00:00
ebanks	bb92e31118	Optimizations: 1. push the ReadBackedPileup filtering up into the ReadFilters for read-based filters 2. stop querying the cigar for its length (just do it once) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2381 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-16 21:39:58 +00:00
ebanks	bb312814a2	UG is now officially in the business of making good SNP calls (as opposed to being hyper-aggressive in its calls and expecting the end-user to filter). Bad/suspicious bases/reads (high mismatch rate, low MQ, low BQ, bad mates) are now filtered out by default (and not used for the annotations either), although this can all be turned off. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2373 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-16 17:28:09 +00:00
depristo	0d2a761460	Bugfix for minBaseQuality to ignore deletion reads. LocusMismatch walker now allows us to skip every nths eligable site git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2357 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 14:38:39 +00:00
ebanks	bf7bab754e	Made getPileupWithoutMappingQualityZeroReads() and getPileupWithoutDeletions() more efficient, per Mark's cue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2356 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 04:35:21 +00:00
ebanks	874552ff75	Pull the genotype (and genotype quality) calculation out of the VCF code and into the Genotyper. [Also, enable Mark's new UG arguments] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2355 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 04:29:28 +00:00
depristo	2cbc85cc7a	min mapping quality and min base quality arguments for UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2354 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 03:57:27 +00:00
depristo	1da97ebb85	Walker for calculating non-independent base errors, v1. Will be moved to somewhere not in core git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2352 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-15 02:40:15 +00:00
chartl	b42fc905e8	Added - new tests (Hapmap was re-added) Modified - Hapmap now takes a -q command to filter out variants by quality Modified - MathUtils - cumBinomialProbLog now uses BigDecimal to handle some numerical imprecisions Modified - PowerBelowFrequency - returns 0.0 if called with a negative number (can't be done from inside the walker itself, but since it's called elsewhere one can't be too careful) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2350 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-14 21:57:20 +00:00
asivache	bd7b07f3f1	added PrimitivePair.Long and a few shortcut utility methods to PrimitivePairs: add(pair), subtract(pair), assignFrom(pair) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2347 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-14 00:15:44 +00:00
ebanks	97618663ef	Refactored and generalized the VCF header info code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2346 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-13 21:02:45 +00:00
ebanks	bd2a46ab4c	I want to move over to hpprojects tonight, so I'm checking in various changes all in one go: 1. Initial code for annotating calls with the base mismatch rate within a reference window (still needs analysis). 2. Move error checking code from rodVCF to VCFRecord. 3. More improvements to SNP Genotype callset concordance. 4. Fixed some comments in Variation/Genotype git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2341 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-13 02:52:18 +00:00
hanna	6955b5bf53	Cleanup of the doc system, and introduce Kiran's concept of a detailed summary below the specific command-line arguments for the walker. Also introduced @help.summary to override summary descriptions if required. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2337 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-12 04:04:37 +00:00
hanna	cdfe204d19	Incorporated feedback from Kiran. Use the Javadoc first sentence extraction capability to just show the first sentence from each line of Javadoc. @help.description can still be used to produce exceptionally verbose descriptions. Also increased the line width as much as I could tolerate (100 characters -> 120 characters). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2336 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-11 21:59:55 +00:00
aaron	09811b9f34	Now that we always output the VCF header, make sure that we correctly handle the situation where there are no records in the file. Added unit tests as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2333 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-11 19:51:05 +00:00
depristo	8f7554d44f	A few improvements to pooled concordance calcluations. Now will show you FN with the -V option. BasicGenotype now prints out a reasonable representaiton wiwth toString git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2320 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 23:09:10 +00:00
ebanks	2869270c11	Fixed deletion depth calculation plus mis-spelling in ReadBackedPileup method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2315 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 21:11:42 +00:00
hanna	5eac510b2f	Refactor the code I gave Eric yesterday to output command line arguments. Convert it from a completely wonky solution to a slightly less wonky solution that will work in more cases. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2310 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 18:57:54 +00:00
ebanks	a45adadf1f	VCFGenotypeRecord already defines all the methods needed to be SampleBacked, so let's annotate it as being SampleBacked. This way, when used as a generic Genotype, sample data can be retrieved. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2305 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-10 04:16:21 +00:00
ebanks	4e54b91ce4	UG now outputs the FORMAT header fields when there's genotype data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2294 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 16:31:07 +00:00
ebanks	7a76e13459	Better explanation in the exception being thrown. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2291 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 03:59:36 +00:00
ebanks	717eb1de96	- Depth annotation now includes MQ0 reads - Removed MQ0 annotation - Updated RMS MQ annotation to use new pileup - UG now outputs all of its arguments as key/value pairs in the header (for VCF) - Cleaned up VCFGenotypeWriterAdapter interface a bit git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2288 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-09 02:53:00 +00:00

... 3 4 5 6 7 ...

906 Commits (f29bb0639b76da656e368317553d103070862d23)