gatk-3.8

Commit Graph

Author	SHA1	Message	Date
hanna	0bb6b9a91a	Locus iterators were implemented in a peekable style, which meant that a locus and its three or four nearest neighbors could be in memory at once. Tweaking the iterators to ensure that previous AlignmentContexts don't have strong references which means that the garbage collector can work effectively to help us trundle through these regions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5820 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-18 21:40:40 +00:00
hanna	a38b2be329	Fix for old, broken invariant where unmapped reads are represented by null rather than an empty BAMFileSpan. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5819 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-18 20:57:38 +00:00
carneiro	ebcd333ed8	Quick small updates: SelectVariants: typo MethodsDevelopmentPipeline: Added CEU Trio WGS dataset git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5818 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-18 20:08:39 +00:00
rpoplin	4b00fd2688	Adding User Exception to VQSR for the case of trying to cluster with an annotation that doesn't exist in the input VCF git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5816 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-18 19:47:51 +00:00
rpoplin	d698c87bbf	More UserExceptions and warnings in VQSR. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5813 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-16 19:03:21 +00:00
kshakir	541b5f7a80	Somehow checked in a version that was building extensions for everything ("") instead of selected packages. Fixed. Also added more logging when extension generation fails. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5812 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-16 16:58:37 +00:00
delangel	a27e8b1dc6	Bug fix - use correct variable to retrieve from map. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5811 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-16 15:32:58 +00:00
rpoplin	d925f76edc	Cutting down on the number of info lines in VQSR so that I can read the warning messages git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5810 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-16 13:35:51 +00:00
delangel	5a7444e186	First step in refactoring UG way of storing indel likelihoods - main motive is that rank sum annotations require per-read quality or likelihood information, and even the question "what allele of a variant is present in a read" which is trivial for SNPs may not be that straightforward for indels. This step just changes storage of likelihoods so now we have, instead of an internal matrix, a class member which stores, as a hash table, a mapping from pileup element to an (allele, likelihood) pair. There's no functional change aside from internal data storage. As a bonus, we get for free a 2-3x improvement in speed in calling because redundant likelihood computations are removed. Next step will hook this up to, and redefine annotation engine interaction with UG for indel case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5809 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-15 23:04:11 +00:00
depristo	3ccc08ace4	Now emits siteType = {SNP,INDEL}. Doesn't work (and may never actually work) for indels under current extended event system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5808 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-15 19:16:09 +00:00
depristo	75db4705ab	Added splitContextByReadGroup() and fixed bug in getPileupForReadGroup() that resulted in a NPE when no reads where present for a read group. Added doc string for getNBoundRodTracks() Intermediate commit for CalibrateGenotypeLikelihoods and GenotypeConcordanceTable, so I have a record of my work. Not ready for public consumption. Really looking forward to making local commits so I can track my progress without needing to push incomplete functionality up to the server. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5807 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-15 17:36:07 +00:00
delangel	fa75efb6ac	Backing off - need to change pileup interface for rank sum tests before indels can be annotated with them git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5804 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-13 21:54:54 +00:00
asivache	befbcd274b	Computes additional stats we want to use later for filtering: median and mad for indel position with respect to starts and ends of all the reads that support it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5803 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-13 21:19:58 +00:00
asivache	5c889580c4	Change of logic: if "read" (sequence 2) sticks out beyond the boundary of the ref (sequence 1) it is aligned to, the extra bases on the left or on the right will be softclipped in the cigar generated for such an alignment, rather than added to the firts/last M block. This also affects alignment offset: if read starts before the ref (used to be represented by a negative offset), the cigar now will start with S, and the returned offset (alignment start) will be 0. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5802 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-13 21:12:54 +00:00
delangel	d4ca8d94fa	Trivial change to allow indels to be annotated by rank rum tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5801 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-13 20:24:08 +00:00
hanna	03452c15c0	Cleanup GATKBAMIndex unit test to allow a more efficient access pattern for FindLargeShards. Runtime of FindLargeShards on papuan dataset is now 75min. GATK proper should benefit as well, although the benefits might be so small as to not be measurable. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5798 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-12 21:50:33 +00:00
depristo	db1f9af679	Now supports multiple records in allele at sites that genotype as reference git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5796 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-12 17:36:27 +00:00
rpoplin	a22e98a2c4	Yikes. Fixing the build git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5794 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-12 01:52:35 +00:00
rpoplin	40797f9d45	Ensuring a minimum number of variants when clustering with bad variants. Better error message when Matrix library fails to calculate inverse. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5793 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-12 01:48:37 +00:00
kshakir	a20d257773	Generating extensions for org.broadinstitute.sting.gatk.datasources.reads.utilities, including FindLargeShards. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5792 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-12 00:49:31 +00:00
carneiro	fb1be2653c	A succint walker that reports GC content by interval. Taking down two old implementations of the same thing from oneoffs. Documentation added to the wiki. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5790 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-11 18:53:11 +00:00
depristo	9a1d0d7076	Simple bug fix to allow multiple records at same site when genotyping given alleles. Takes only the first record (respecting filters, SNP type, etc), and issues a warning if there is more than one valid record at a site git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5789 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-11 14:17:14 +00:00
ebanks	dfdef2d29b	PLEASE READ ME! In order to prepare for the upcoming changes to VCF4, we felt it was best to split up the vcf3 and vcf4 codecs (vcf4 is not backwards compatible to vcf3 and certain changes are too complex to handle in both codecs). Using the 'VCF' rod type in the GATK will now throw a UserException for vcf3.2 or vcf3.3 files telling you to use the 'VCF3' type instead (and vice versa). Integration/unit tests have been updated. For programmers: note that there is currently a lot of code duplication in the two codecs (although I pulled out the easy stuff to a VCFCodecUtils class); however WE ARE FREEZING THE VCF3 CODEC AND WILL NO LONGER MAKE CHANGES TO IT. All updates/improvements will be targetted to the vcf4 codec only as vcf3 is there only to be able to read legacy files. People should really be using vcf4 files only. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5787 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-11 12:07:44 +00:00
delangel	852e555c00	Fix broken functionality from previous commit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5786 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-10 18:38:25 +00:00
ebanks	8d47d2e813	Fix for Tim. It was possible for the constrained mate fixer to dump its cache in them middle of a given realignment (so the IndelRealigner was playing by the rules). No longer possible. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5785 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-10 16:48:24 +00:00
delangel	3c364279f4	Add simple ability to create "X out of N" combined files: if a site is present in at least X input rods, it gets output, otherwise it's skipped, controlled with argument -minN. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5783 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-10 15:27:18 +00:00
hanna	f275be6968	A 'fat shard' finder. Cranks through the indices of a BAM file or list of BAM files looking for outliers (outliers right now are defined naively as shards whose sizes are more than 5 stddevs away from the mean). Runs in 13 minutes per chromosome on 707 low pass whole genome BAMs -- not great, but much faster than running UG on the same region to discover anomalies. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5782 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-10 12:56:47 +00:00
kshakir	7d21350a17	Fixed import. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5780 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-09 18:07:40 +00:00
asivache	0861451726	Print on multiple rows in standalone command line mode when the sequences are too long git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5779 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-09 13:51:00 +00:00
ebanks	bf40351094	Minor update git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5778 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-08 03:48:37 +00:00
ebanks	15c7bd82a5	Fix for IndelRealigner memory problem. Now the Constrained mate fixing writer is told whether a read has been modified and, if it wasn't, can dump it when the cache needs to get flushed at places with tons of coverage. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5777 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-06 19:34:41 +00:00
rpoplin	d8a761bbbd	Warn the user if trying to train with too few variants git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5776 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-06 17:47:58 +00:00
hanna	c2e8c460cb	Factor out all testing dependencies into a separate test configuration and only download that test configuration when running unit/integration tests. This means that the build will (hopefully) never break because it can't fetch a file that isn't required for the GATK to run. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5775 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 22:42:11 +00:00
rpoplin	b94d8dae17	Removing requirement of providing known track in VQSR for the non-humans. Updating placement of legend on tranche plot. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5773 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 20:24:06 +00:00
delangel	7d7ce6cf00	Two embarassing bug fixes: a) Forgot to convert from phred to log-prob when computing gap penalties from recal table. b) Forgot to uncomment code to correctly deal with hard-clipped bases in a read. But because of this, had to do a short term workaround to at least temporarily return class from hardClipAdaptorSequence to GATKSAMRecord. Otherwise, I get exceptions when casting because somehow some reads in HiSeq get to be SAMRecord (which GATKSAMRecord inherits from) but some reads get to be BAMRecords (which can't be cast into GATKSAMRecord), not sure why. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5771 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 17:08:34 +00:00
kshakir	28b897d5de	Fixed O(N^2) operation when scattering interval files. Cleaned up intervals contig count function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5768 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 03:32:35 +00:00
carneiro	3882d1b9c0	fixing the build \o/ git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5767 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 00:57:49 +00:00
kshakir	8ad547e6c2	Fixed another interval bug where dividing up N intervals into N parts wasn't working. Minor updates to the FCPTest to match the changes due to using the old indel caller. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5766 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 20:49:35 +00:00
hanna	5c6965575e	Some refactoring that Mauricio and I worked through together. Changed filters to extend from org.broadinstitute.sting.gatk.filters.ReadFilter rather than directly from net.sf.picard.filter.SamRecordFilter, which allows us to add an initialize(GATKEngine) method so that filters can do any initialization they'd like based on CL arguments, SAM headers, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5760 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 19:29:08 +00:00
carneiro	b66c6dced1	- No longer prints out non confident calls (they were leading to tables that don't add up and confusing some Pacbio folk). - Added sensitivity and Specificity to the report. - With the changes in genotype likelihoods, the indel analysis only happens if the BAM file also has an extended event. Not great, but at least it's not broken. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5759 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 19:26:55 +00:00
carneiro	7ed8b4ddb0	Making sure CalculateLikelihoodsAndGenotypes returns an empty variant context when 'EMIT_ALL_SITES' and 'GENOTYPE_GIVEN_ALLELES' are being used, now for indels too! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5756 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 18:04:56 +00:00
rpoplin	6c7a0adc76	Updating VariantGaussianMixtureModelUnitTest to use truth sensitivity cutting git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5750 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 13:56:01 +00:00
delangel	a19389528d	Bring back from the dead the old likelihoods model for indels, which has worse performance but is about 4x faster. Enabled with argument -GSA_PRODUCTION_ONLY in UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5748 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 22:38:33 +00:00
carneiro	e5cc0f4eec	Added 'specificity' to variant eval's Validation Report evaluator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5742 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 20:48:30 +00:00
rpoplin	b88dec387c	clean up from VQSR movement git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5741 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 20:35:30 +00:00
rpoplin	23cd3a7a5d	Moving VQSR v2 to core. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5740 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 20:20:06 +00:00
rpoplin	44a717f63a	Good bye VQSR v1. This commit will break the build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5739 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 20:09:52 +00:00
hanna	2dacf1b2b2	Better header support when running R's read.table(...,header=T). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5738 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 19:56:20 +00:00
hanna	ad8c786b2d	Now more easily R-parseable. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5737 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 19:30:50 +00:00
rpoplin	5bade81c6d	Adding tranche plot generation back to VQSR git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5736 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 19:26:26 +00:00
rpoplin	e73720c2db	Updating VQSLOD annotation description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5735 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 19:01:08 +00:00
rpoplin	11052918d9	Better exception text for common error in VQSR. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5734 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 18:37:25 +00:00
rpoplin	4bbce42861	Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 18:12:47 +00:00
rpoplin	6323fb8673	misc cleanup in VQSR git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5732 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 18:00:22 +00:00
hanna	f3bd11a02e	Dress up some formatting issues. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5731 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 17:35:18 +00:00
hanna	9c809ed68e	A walker to analyze the memory consumption of reference, reads, and RODs at each base both in bytes and as a percentage of the used heap size. May be a bit buggy at this point; there are a lot of metrics around the Java heap and I'm not completely sure that the metrics I'm outputting are exactly the ones that I'm looking for. Also fixed a documentation bug in my Sizeof class. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5730 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 17:08:15 +00:00
ebanks	d4cbd8691c	Make the default that we only output SNPs (so that when I make another release we don't get flooded with questions about why the UG is all of a sudden so slow) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5729 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 16:38:55 +00:00
rpoplin	70f8ab6f89	Adding AF bin stratification for VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5728 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 15:22:50 +00:00
hanna	870e65a685	Fixing a build failure because I want to be completely sure that the code I checked in immediately following the build breaking code passes integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5727 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-03 02:09:53 +00:00
hanna	411980a50a	Performance enhancements in GATKBAMIndex. Not sure these will assist in a normal use case, but they cut startup times and memory allocation noise in the profiler, making my profiling time more productive. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5726 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 20:48:16 +00:00
delangel	422d4ceeea	removed useless file - no need for tableRecalibration, right now everything is done in PairHMMIndelErrorModel git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5725 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 20:35:44 +00:00
delangel	2a80ffa2ee	Totally experimental, barely useable not to be used yet implementation of an "Indel Quality Recalibrator" Idea is that any indel that's not in input dbsnp is treated as an artifact, and then a csv is built with # of indels and # of observations as a function of each input covariate (initially, only cycle, read group and homopolymer run are useful). Then, when computing likelihoods of indels based on input haplotypes we compute gap penalties based on value of covariates at read. Feature is disabled by default with hidden arguments. TBD if usefulness of feature is worth the extra time and pain. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5724 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 20:31:43 +00:00
rpoplin	3224bbe750	New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 19:14:42 +00:00
ebanks	fcf8cff64a	We didn't actually support all of these extensions. Updated to be accurate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5722 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 19:03:46 +00:00
carneiro	34092fd32f	minor update... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5716 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 21:29:01 +00:00
carneiro	36ac8beee1	Making the GATK unpredictably random... through an option! set -ndrs if you want the GATK to be really random (non-deterministic). Engine option, available to every walker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5715 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 19:29:08 +00:00
carneiro	f97e7d2fb4	Walker that calculates the percentage of bases that are covered to at least 20x. Very useful! In oneoffs until someone else thinks it's as useful as I think it is ;) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5714 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 19:19:39 +00:00
ebanks	deed7c47a1	Continuing the epic fail, some of our existing integration tests were wrong because of the lazy loading failure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5712 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 17:54:41 +00:00
ebanks	ab9ffb1a74	Epic failure on the lazy loading of genotypes: if the input VCF had its samples unsorted and we used a walker that didn't require genotypes, then we would sort the samples but not load genotypes (and therefore the genotypes wouldn't match the samples anymore). Added simple integration test to cover this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5711 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 16:03:45 +00:00
hanna	96571b55be	Disable caching of ReadShards by the GenomeLocProcessingTracker (at least temporarily). Unfortunately this does not completely fix the IndelRealigner exception that Ryan is seeing, but it helps things quite a bit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5710 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 13:59:34 +00:00
carneiro	a5b96e0e04	I have to remember that this is Java, not C. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5709 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-28 17:40:14 +00:00
rpoplin	b7334dcc1e	Rank sum test annotations are the Z-scores from the test instead of the p-value. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5707 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-28 14:35:00 +00:00
ebanks	45081c32d7	continuing from last night, the integration tests weren't covering the right behavior either git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5706 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-28 13:30:57 +00:00
ebanks	f34e6d5b8c	Somewhere along the way someone broke this tool and failed to update the documentation to boot. Fixing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5705 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-28 03:16:20 +00:00
ebanks	ae8f3f2cde	Check for bad reference bases before creating simple/'empty' VCs. Updated the code in the indel GL model to be consistent and to use the existing utility in the Allele class. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5704 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 23:55:20 +00:00
depristo	6cce3e00f3	A test walker that does consensus compression of deep read data sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5702 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 22:00:48 +00:00
rpoplin	3907377f37	When genotyping given alleles, for multiallelic sites we go back to the reads and use the alternate base with the highest sum of quality scores instead of taking the first alternate allele from the vcf file git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5701 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 21:31:09 +00:00
droazen	6e9e766a71	The tighter interval validation wasn't interacting well with unmapped intervals -- altered the validation methods to not throw an error for unmapped intervals. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5700 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 20:56:46 +00:00
hanna	6d5e45b5c6	Revbump Picard dependencies at Tim/Kathleen's request. Exclude anonymous classes from PluginManager. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5699 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 20:38:05 +00:00
droazen	d650efd40a	Fix for bug GSA-449: Intervals that are not in GATK format are not validated to the same standard as GATK format intervals. Full validation against contig bounds is now performed for all intervals, regardless of their source. Also fixed a few tests for validation exclusions that were backwards. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5698 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 18:12:10 +00:00
kshakir	df35a143b2	Removed -debug/--debug_mode. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5697 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 10:56:39 +00:00
hanna	27495a0c64	Killed quiet mode. Should probably kill debugMode as well, but Queue's using it. Will check with Khalid tomorrow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5695 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 04:17:36 +00:00
hanna	f3dacd3c40	Use ByteBuffer.allocateDirect() instead of ByteBuffer.allocate(). ByteBuffer.allocateDirect() behaves like Java NIO MappedByteBuffers in that it consumes address space, which counts against our virtual memory allocation; but cannot be destroyed or otherwise freed. This was definitely contributing to the LSF failures that I was seeing, but I'm not yet convinced that it's the sole source of these virtual memory 'leaks'. More tomorrow as the results of my whole exome tests start to roll in. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5693 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-27 02:01:11 +00:00
chartl	7afeb1ab17	Removing broken imports (boo) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5692 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-26 18:55:25 +00:00
rpoplin	379f837e82	RankSum z-scores are looking quite good, so RIP Wilcoxon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5691 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-26 18:34:39 +00:00
chartl	bc3fd70b0a	Removing the old association walker, switching test to just validate that MannWhitneyU is doing the right thing. Unit tests still pass. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5690 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-26 18:05:19 +00:00
kshakir	f619dd3ca7	Refactored IntervalUtils used to parse and scatter intervals for Queue. Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals. Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly. TODO: general purpose function to divide data evenly. Skip over comments when parsing picard analysis files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-26 00:06:00 +00:00
hanna	57a4700299	Ported small BAM performance test suite to the Google Caliper microbenchmarking suite. Looks promising, but I'm still not sure that GC is a good long-term solution. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5683 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-22 22:09:17 +00:00
chartl	a56a2dfdb7	Nothing to see here. Move along. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5681 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-22 15:01:02 +00:00
delangel	600617a63c	Enabled code to deal with hard-clipping adaptor sequence when processing reads in pileup in indel caller. Proven now that changes are minimal (4 less calls in NA12878 chr20, quals slightly different), minor changes in vcf fields in integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5679 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-22 14:10:33 +00:00
chartl	88735a8c9b	Adding in a delta to try and better measure effect size -- equivalent to looking at the lower end of the N^th percentile confidence interval. Kind of a hacky way to add it in, the infrastructure is about due for a streamlining rewrite. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5676 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-22 03:53:33 +00:00
hanna	7428ae338a	A fix for Marian Thieme's NPE in the new sharding system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5675 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-21 19:47:14 +00:00
chartl	5b9a8555cd	Queue graph time is currently of O(n^m) where n = num jobs, m = num unique base files. This script therefore was running in order 1200^16, which I don't think would finish before the heat death of the universe. For now, push down the number of files to 1 and gather them outside of Queue, once I've fixed up scatter-gather in core, outputs can be uncommented. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5674 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-21 12:56:25 +00:00
ebanks	cbcdfc584d	Moving out of core and into playground git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-21 02:30:22 +00:00
depristo	cc78027bd3	Two optimizations. Even more aggressive printProgress meter optimization to only even consider doing work once every 1000 cycles through the engine. Second, GenomeLocParser now uses a single indirection around the contigInfo variable. This class uses a last used cache to retrieve efficiently contig information instead of always returning to the underlying SAMSequenceDictionary hashmap to make genome locs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5670 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-21 01:31:26 +00:00
depristo	29857f5ba6	Fix for instability in output of fasta alternative reference maker when snpmask and snp files are provided and have overlapping records. The order of the records changed due to optimization of the refmetadatatracker, and uncovered this non-determinanism. Now preferrentially masks out includes sites from snps before considering masking out sites in snpmask git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5669 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 21:54:09 +00:00
kshakir	8619f49d20	Added a utility method to retrieve the contig lengths for WG chunking. Added a rudimentary GATKReportParser for parsing VE3 results. Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils. The tag type for .rod files is DBSNP, not ROD. More explicit return types on implicit methods. Added null checks for implicit string to/from file conversions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 19:22:21 +00:00
delangel	59dd79faab	One more optimization: don't use Math.round(), but do my own rouding/casting. UG now about 40% faster calling indels, 30-35% faster calling snp's+indels simultaneously. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5667 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 19:15:58 +00:00
delangel	246d8190b5	Round one of "easy" zero-effort optimizations to UG's indel caller. Mostly inline functions, avoid repeated computation and try to optimize SoftMaxPair() which is by far the bigest runtime hog. More to come... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5666 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 18:57:34 +00:00
depristo	a8f8077d7a	Simple optimizations for cases where there is no data or RODs at sites, such as with the FastaStats walker. private static immutable Lists and Maps in underlying data structures that have no associated data. Also, avoiding a double map.get() in the low-level genome loc parser. RefMetaDataTracker is now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5664 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 10:52:16 +00:00
hanna	54660a8c25	Fix requested by Lee Lichtenstein: first check to see whether it's time for a progress message, then aggregate metrics. Makes the overhead of printProgress in RealignerTargetCreator go from >20% to ~3%. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5663 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 03:22:48 +00:00
hanna	49550e257f	Fix for JamesP's issue. This issue appeared because of a design flaw in the interface between SAMDataSource and IntervalSharder that needs to stay around until the original BAM sharder is retired. Will add a JIRA to fix design flaw. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5661 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-19 00:52:13 +00:00
depristo	541c9109b3	V1 of GATK Resource Bundling system git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5659 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-18 19:23:45 +00:00
ebanks	673772a522	Catch samtools exceptions and make them 'BAM Exceptions' asking the user to run Picard's validator and re-index the file before posting anything to the forum. Let's see whether this helps or not. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5658 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-18 03:52:43 +00:00
ebanks	e97a5ca161	Rename 'verbose' argument to 'debug_file'. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5657 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-18 03:17:13 +00:00
chartl	e28fc21642	Spurious associations can develop from including ambiguous reads in these tests. Perhaps MQ0 reads shouldn't be used for anything except MQ0, but the best way to do that is to restructure the code, so for now I'll put it off. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5656 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-17 23:17:03 +00:00
ebanks	49ea07acce	My fixes to Tribble yesterday revealed that some of the test VCFs for integration tests were actually malformed. Also, Guillermo updated the b37 dbSNP VCF and that broke some tests. Should be good for now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5655 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-17 03:39:11 +00:00
chartl	e5ef8388fc	BatchMerge - AlleleVCF --> AllelesVCF, this (combined with Eric's fix) will solve James P.'s forum issue. After viewing results on real case/control data from RAW -- it's really working quite well. ReadIndels, however, needs to use a T-test rather than a U-test, especially in deep coverage (at indel sites, the reads with indels will have mostly the same number of CIGAR indel elements -- one -- which doesn't really play nicely with the UTest when sample sets are large). Modified ReadsLargeInsertSize to be a two-way test (e.g. ReadsLarge and ReadsSmall). BaseQualityScore also suffers from the same issue as read indels, so switching over to a T-test in that case as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5653 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 22:03:16 +00:00
ebanks	1c32deb108	For some reason I wasn't allowing expressions to be used with the -all argument. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5652 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 20:59:10 +00:00
corin	2cf6a06503	Throwing an error if INFO fields arguments contain whitespace. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5651 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 20:52:55 +00:00
corin	fce6d25075	Moved the reference ID to a meta data field for validity declaration. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5650 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 20:28:56 +00:00
corin	59215dab48	Now writes results to a minimal vcf with annotations included in the INFO field. Must be run with -NO_HEADER to totally remove header for the most bare bones vcf; otherwise also includes command line meta data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5649 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 20:14:02 +00:00
ebanks	fe26954ac6	Minimal support for reading in VCF4.1 files. Added TODOs that need to be fixed or cleaned up to truly support this version. VCF constants updated. Lower-case bases permitted. Please let's make sure to refactor once we're ready to support it for good. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5648 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 18:59:37 +00:00
ebanks	7e9051ea25	The solution to James's bug was just to clean up the code and simplify it. What happened was that functionality that got put into UGCalcLikelihoods was then generalized into the UG engine but then never removed from UGCalcLikelihoods. This knowingly breaks the batch merger, but Chris said he'll take care of it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5647 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 18:05:10 +00:00
hanna	0d7cca169e	Sigh. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5645 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 14:37:24 +00:00
hanna	0965020804	Screwed up the doc string. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5644 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 14:30:20 +00:00
hanna	be3bad1f61	Low-memory sharding is now enabled by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5643 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-15 14:22:07 +00:00
ebanks	2830dc70b7	UG can still return null in certain nasty cases git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5642 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 20:11:17 +00:00
fromer	8e0f5bc5a5	Prevent NullPointerException in cases where SNP is filtered git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5641 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 19:59:59 +00:00
depristo	ee94af3539	Oops, left out of earlier commit git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5640 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 18:21:16 +00:00
depristo	8ed9c0f518	VariantsToTable now blows up by default if you ask for a field that isn't present in a record. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5636 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 14:42:43 +00:00
fromer	b3cd14d10a	Since GCcontentIntervalWalker no longer uses any ROD, turn it into a LocusWalker that traverses by REFERENCE git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5635 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 03:15:09 +00:00
aaron	2089c3bdef	removing; should of gone to the CGA repo git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5633 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 22:17:45 +00:00
aaron	da6f2d3c9d	adding the capseg tools to the new walker repo git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5632 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 22:11:08 +00:00
kshakir	4bb573b1f5	Centralizing a bunch of Broad specific utility functions from code scattered in GSA-Firehose, PipelineTest, custom QScripts, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5631 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 21:29:02 +00:00
ebanks	91d308fc6d	temporary patch until Picard (hopefully) fixes the NM calculation to deal with reads that align off the end of the contig git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5630 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 19:18:18 +00:00
ebanks	fa6468d167	Remove the adaptor sequence clipping read filter because it is dangerous (it breaks LocusIteratorByState). We'll bring it back to life when ReadTransformers are created. Instead, have the utility code return a new clipped SAMRecord (necessary so that we don't break SNP calling in UG when the indel caller tries to hard-clip the reads). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5629 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 18:47:47 +00:00
hanna	5849e112e1	Fix exception in block weighting minus function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5628 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 17:07:04 +00:00
hanna	a36adf0c6b	Request from the cancer team -- guarantee via javadoc that the returned read metrics are actually a clone, which they can do with as they wish. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5626 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 15:10:46 +00:00
delangel	06b1497902	Corrected bad merge. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5625 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 15:02:09 +00:00
delangel	9134bf3129	Long-forgotten change I neglected to commit a while back: add ability for SelectVariants to extracts either SNPs or Indels from combined vcf file. Not the ideal place to do it but it's important to at least have something to split vcfs now that we call snp's and indels combined. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5624 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 14:58:44 +00:00
chartl	8e0d191a70	Added a walker to help sort out which samples in a region are giving signal. Lots of reused code that shouldn't be. Will refactor later. Also fixed an "issue" with InsertSizeDistribution -- apparently for mate pairs, the first mate (karyotypically) will have a POSITIVE insert size, and the second a NEGATIVE insert size -- thus the insert size distribution was being conflated with enrichment/depletion of first-in-pair or second-in-pair reads. Gah. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5623 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 13:53:31 +00:00
chartl	efe6c539ac	Re-enabling disabled test. Apparently T-tests are very picky about your using an unbiased variance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5622 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 03:05:50 +00:00
chartl	42bc003f46	Oops. I'll need to look at this, I think it was accidentally enabled. Disabling for now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5621 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 00:54:52 +00:00
hanna	22a11e41e1	Rewrite of GATKBAMIndex to avoid mmaps causing false reports of heavy memory usage. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5620 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 23:49:58 +00:00
chartl	36d8f55286	Use the 'standard' arcsine transform git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5619 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 23:11:45 +00:00
chartl	8125b8b901	Old changes to the exome VQSR search. SGA updated to include new proportion-based insert size test. Major fix for dichotomization test: MathUtils now optionally ignores NaN values for sums, averages, variances. In the future this feature can be pushed back into the AssociationContext object iself (e.g. no data? no entry), but it's kept like this for transparency for now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5618 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 23:00:50 +00:00
rpoplin	30a19a00fe	Fix for when running with EMIT_ALL_SITES but not GENOTYPE_GIVEN_ALLELES. Still want to emit a site even when over the deletion fraction for example. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5617 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 20:07:06 +00:00
delangel	488622041d	Further trivial cleanup: Renamed DindelGenotypeLikelihoodsCalculationModel to IndelGenotypeLikelihoodsCalculationModel git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5616 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 18:00:48 +00:00
delangel	3b424fd74d	Enable new indel likelihood model by default, cleanup code, remove dead arguments, still more cleanups to follow. This isn't final version but at least it performs better in all cases than previous Dindel-based version, so no reason to keep old one around. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5615 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 17:54:46 +00:00
depristo	9c36b0a39b	Refactored read clipping framework into a generic utilities class, independent of ClipReadsWalker, which now uses this framework. Some more cleanup is really needed, as some of the arguments to the classes are really only useful for ClipReads ReduceReadsWalker -- does consensus-based read compression, v2. Does all of the consensus calculations within the ConsensusReadCompressor per sample, and multi-sample case is handled by MultiSampleConsensusReadCompressor. For deeply covered data sets, this projects a significant reduction in the number of mapped reads. Impact on analysis call quality tbd. Expected to be relatively minor, as the system automatically detects regions without a strong consensus, and expands a window around these so that +/- 10bp of all reads are shown around the unclear sites. Not usable yet -- as it does not yet support streaming output, and actually holds all reads in memory at once. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5610 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-10 13:55:05 +00:00
depristo	13c5f3322d	Added argument to avoid writing 0 over all uncovered contigs, so you can just plot chrX, for example git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5609 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-10 13:50:21 +00:00
chartl	de4eaa455e	Squashing some bugs. Current implementation of AlignmentContextUtils.splitContextBySample() eliminates all sample meta data. Per Mark's request I'm working around this rather than fixing it -- the extender now maintains a mapping from sample id to sample object. Addition of a proportion test for large-insert-size reads, and slight refactoring of code to deal with bad window initialization of subclasses (e.g. chris forgot that constructors aren't inherited) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5608 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-09 21:07:52 +00:00
hanna	b4b52cc0fe	Reduce unnecessary repetitive accesses to the BAM index file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5607 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 19:28:14 +00:00
kshakir	0a58d7aa1a	Marked boolean SAMFileWriterATD arguments as flags so scala generator maps them to Boolean instead of Option[Boolean]. Using the VCFWriterATD isCompressed to check if the VCF index will be auto generated. Tracking BAM and Tribble indexes as @Inputs and @Outputs in generated QFunctions. Updates to the BamGatherFunction to disable the index during merge when disable_bam_indexing = true. Made a shortcut for live-running pipelinetest, pipelinetestrun. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5606 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 18:44:32 +00:00
depristo	866f4fd569	Test version of consensus compressing strategy. Cannot be used, and is being rewritten right now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5605 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 18:37:03 +00:00
droazen	80d547ae71	Fix for bug GSA-445: Sequence dictionary validation can be very slow with large numbers of contigs. SequenceDictionaryUtils.getCommonContigsByName() was running in O(n^2) time due to poor choice of data structure -- modified it to run in O(n) time. Also removed an unnecessary O(n log n) step at another stage in the sequence dictionary validation process. In tests with a 181,813-entry sequence dictionary, runtime improved from an average of 21.4 minutes to 45.1 seconds. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5604 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 18:33:10 +00:00
ebanks	b6e7b5dace	Updating to reflect my recent Tribble fix git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5601 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 11:48:00 +00:00
ebanks	4f17004590	Allow walkers to enforce the ordering in which ReadFilters are applied (so that they're now done in the order specified in the walker). Useful if you have a computationally expensive filter (like adaptor clipping) that should only be applied to reads passing all other filters. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5600 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 03:34:50 +00:00
hanna	53db7b8faa	Did some refactoring which broke some unit tests, and then failed to run the unit tests. Definitely not my best effort... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5599 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 03:31:52 +00:00
ebanks	74755cfd1c	Adding a ReadFilter to hard-clip out bases from adaptor sequences. This is actually slightly more correct than having it be part of LocusIteratorByState because it allows us to remove reads that are complete garbage (and there are definitely some) based on the insert sizes. However, although conceptually this is great, it doesn't actually work. 'Why?' you may ask. Because when we hard-clip reads it often changes their start positions... which means that reads are no longer passed to LocusIteratorByState in coordinate order... which makes it (understandably) barf all over the place (and makes for some really fascinating SNP calls). This took me forever to find. I'm going to bed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5598 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 03:15:58 +00:00
ebanks	cd61ef7169	Re-enabling multi-threaded integration tests. To make this work, downsampling and annotations are disabled for this test so that we don't have randomization issues for it based on which shards get executed first. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5597 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 03:07:39 +00:00
hanna	fece2167b3	Prototype implementation of protoshard merging when protoshard n and protoshard n+1 completely overlap. Gives a small but consistent performance increase in non-intervaled whole exome traversals (2.79min original, 2.69min revised). Needs a more in depth analysis of optimal shard sizing to determine a true optimum. Also renamed a variable because Khalid disapproved of my naming choices. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5595 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 02:09:14 +00:00
hanna	32d502c122	Enable BAM OTF index writing by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5594 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-07 23:44:25 +00:00
droazen	cb3e8aec5e	Modified the buildfile and help extractor doclet so that help text is only extracted from source files that have been modified since the help resource file was last generated. This significantly speeds up builds where only a few source files have been modified, at the expense of making clean builds take slightly longer. Here's some performance data gathered by testing the old and new versions of extracthelp in isolation and averaging across 10 runs: old extracthelp, 1 modified source file: 20.1 seconds new extracthelp, 1 modified source file: 7.2 seconds <-- woohoo! :) old extracthelp, clean build: 17.8 seconds new extracthelp, clean build: 20.5 seconds git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5590 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-07 18:40:53 +00:00
ebanks	af09170167	As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-07 17:03:48 +00:00
kshakir	45ebbf725c	Instead of always merging Picard interval files they are optionally merged by Sting Utils. Disabled the MFCP while the FCP gets an update. Minor updates to email messages for upcoming scala 2.9. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5588 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 21:12:05 +00:00
carneiro	89bb21d024	typo in the argument description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5587 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 19:45:32 +00:00
rpoplin	3f3f35dea0	UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 19:06:24 +00:00
ebanks	1aa4083352	Fortunately this code isn't used by anyone right now, but it needs to be fixed before someone unwitingly does: flags were wrong according to the SAM spec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5584 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 17:16:41 +00:00
hanna	b231a40da5	Augment PrintLocusContextWalker with extended event info. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5583 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 13:42:48 +00:00
aaron	ab5c4064ed	quick bug fix for variant context utils: only calculate the max AC if we're using the mergeInfoWithMaxAC flag, and if so deal with sites that have multiple alternate alleles correctly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5582 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 05:36:52 +00:00
rpoplin	cc713f2769	fixing exception text git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5581 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 00:29:13 +00:00
ebanks	4b451314b2	Only store a read in the mate hash if it could possibly be moved. This reduces memory consumption especially when dealing with a case of tons of unmapped reads at the end of the bam; however, it's only mildly helpful for chr1 of the Papuans (there's a truly massive pileup 120Mb into it; more thought needed at a later point). Integration tests changed only because some of the reads in the original bam were busted to begin with (it's an old pilot 1000G bam). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5580 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 22:20:09 +00:00
chartl	79b5fa6cc5	Structural refactoring in advance of dichotomization statistics; generalization of statistical test infrastructure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5579 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 18:52:32 +00:00
asivache	77ca4eef31	IntelliJ complains that @Override is not allowed when implementing interface methods. Whatever. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5578 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 16:57:59 +00:00
ebanks	f4c06bb4ce	Traversal now says 'done with mapped reads' instead of 'done' so we don't confuse users when there are a lot of unmapped reads left to process. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5577 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 15:11:28 +00:00
fromer	5eccc7e528	Added annotation of INCORRECT SNP-based aa annotations in case of MNPdependentAA:true git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5576 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 02:46:45 +00:00
chartl	bb6a30611c	Forgot to modify the test too. What a bad commit. Sorry guys. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5575 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 02:11:08 +00:00
chartl	a0d096c993	Forgot an import statement git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5574 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 22:55:00 +00:00
chartl	b52c3e7e30	Make the window and slide-by values command-line accessible, and standardize for every context. Move the test classes (which are abstract association context modules) into the proper directory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5573 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 22:37:12 +00:00
droazen	db9908ec02	Small correction to the unit test code from my last commit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5572 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 18:55:38 +00:00
droazen	a5acb0b7a6	Fix for bug GSA-314: Detect -XL and -L incompatibility. An ArgumentException is now thrown if the combination of -L and -XL intervals specified on the command line results in an empty interval set after set subtraction. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5571 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 18:41:55 +00:00
carneiro	b722ebf244	quick help/comments updates to match the wikipage. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5569 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 12:55:55 +00:00
rpoplin	96f0f0d706	Fixing use of String != String git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5568 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 01:12:00 +00:00
depristo	095125152b	Updated to now longer include 2nd-best base output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5567 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 20:13:10 +00:00
rpoplin	b2a0331e2d	Pushing hard coded arguments into VariantRecalibratorArgumentCollection git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5566 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 19:55:09 +00:00
rpoplin	79c43845ad	Changing Uniform approximation to Normal approximation in rank sum test. n factorial was overflowing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5565 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 18:18:39 +00:00
depristo	b316c9a590	Renamed StratifyAlignmentContext to AlignmentContextUtils, and StatiefyContextType to ReadOrientation. Also, went through the system and deleted all references to second bases. That ship passed long ago. This was the actual commit, the last was an intellij error git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5564 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 15:36:17 +00:00
depristo	5cca100aea	Eliminated the redundant StratifiedAlignmentContext, which previously just held a ReadBackedPileup, and made all of the class methods here just static functions. Far more logical organization, and avoided O(N) endless copying of data for the COMPLETE context. Many tools have been trivially reorganized to take an alignment context now. Everything passes integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5562 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 14:20:43 +00:00
rpoplin	98798eb276	Adding ReadPos rank sum test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5560 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:28:41 +00:00
rpoplin	09e89c8c97	Adding ReadPos rank sum test. Transitioned rank sum tests over to using Chris's implementation in order to harmonize the codebase. There isn't any reason to have competing implementations of rank sum. Thanks to Chris for adding the necessary hypothesis testing options. WilcoxonRankSum.java will be deleted soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5559 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:26:35 +00:00
depristo	11822da578	Stand alone, GATK dependent tool that Reads a list of BAM files and slices all of them into a single merged BAM file containing reads in overlapping chr:start-stop interval. Highly efficient when working with thousands of BAM files. Can merge 1MB of sequence of 1600 4x BAMs in 4g in only 2 hours. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5558 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 13:41:29 +00:00
fromer	27bfec785e	Some walkers for printing FASTA of reference for bed ROD, and "inverting" a bed file (finding regions not covered in bed) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5554 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 21:13:51 +00:00
droazen	0927b7c297	Fix for bug GSA-441: BAM file list with blank lines gives a confusing error message. Lines containing only whitespace in .list files are now ignored. Also added support for comments in .list files: lines whose first non-whitespace character is '#' are now also ignored. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5550 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 15:04:35 +00:00
kshakir	4f8411f4b5	Revved Picard to access new flag to disable mmap for bam indices. Only added a 3% speed boost but the mmap was added to the heap count, making it harder to specify/restrict the total resident memory size in LSF. Specifying -Xmx4g will now stay much closer to 4g resident memory usage versus bumping up to 9g when accessing 900 x ~8Mb bai's. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5549 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 01:40:41 +00:00
asivache	df53351b0f	Get rid of score cutoff at 0 in the alignment matrix (i.e. score[cell] = max(0, score[from_parent_cells]). Use the computed score as is. Technically, it's pretty much NW now, not SW. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5548 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 00:11:04 +00:00
carneiro	0a772688fe	implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced! QuickCCTest is my test script for the gatherer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:15:21 +00:00
carneiro	dac1309dbd	Added two modes for selecting variants at random (random sampling). -number N -- generates a VCF with exactly N randomly chosen variants with equal probability. -fraction F -- generates a VCF with approximately F (between 0-1) randomly chosen variants with equal probability. (Similar behavior to RandomlySplitVariants walker). The reason for two modes is that the first one may need a lot of memory if your sample size is too large. The wiki is being updated with this information now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5545 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:12:40 +00:00
carneiro	8a3b7d88aa	It was returning 1 when it should return 0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5544 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 20:50:38 +00:00
depristo	c7445a6fbd	Now that logging is so standard, only prints messages about logging to DEBUG. Also, found a way to silence the mime.types warning, that doesn't matter at all to us. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5543 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 16:49:39 +00:00
droazen	7b452ea2b9	Fix for bug GSA-430: Can't specify same BAM file twice on the command line. An ArgumentException with an appropriate error message and a list of the duplicate BAMs is now thrown in this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5542 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:23:24 +00:00
hanna	deab9f0aa5	Initial work on proto-shard merger: - create size() method that returns an approximation of the uncompressed size in bytes of BAM span. I'll use this method as a protoshard weighting function until we determine how to normalize the weights across the different data access mechanisms (reads, reference, RODs). - Implementations of basic union/intersection/subtraction mechanisms for BAM spans; should be enough to get an accurate weight for two proto-shards put together. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5541 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:03:43 +00:00
chartl	328f89f66a	Minor changes to MannWhitneyU: - Comment fixes to better explain why two-sided test wants to use the LOWER (not higher) value for U - Much more direct testing of MWU functions - Uniform approximation was always using the < cumulant (sometimes the > cumulant should be used instead) - Uniform approximation currently not used (regime in which it was being used was not the right one -- not necessarily bad, but not an improvement over normal) + this particular approximation is for major imbalances of the form m >> n. Code may be altered in the future to use this method for this particular regime, if the method's not too slow. - Hook into one-sided test. RegionalAssociationRecalibrator: NaNs were being caused by presence of Infinity and -Infinity values out of the walker. Currently I'm just re-setting them to arbitrary post-whitened values, but the walker will be changed to prevent output of these values, and the "fix" will undone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5539 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 17:03:02 +00:00
chartl	fff11a3279	No more pesky NaNs for norms ( HINT::: ((double) x) == Double.NaN is NOT (somehow) the same as Double.compare(x,Double.NaN) == 0). Effectively reverse sorting by changing (rank/size) to ((size-rank)/size). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5538 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 22:43:24 +00:00
carneiro	5d26c66769	Count Covariates is almost scatter-gatherable now! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 22:25:33 +00:00
rpoplin	5ddc0e464a	Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 21:04:09 +00:00
carneiro	0f4ace0902	fixed a bug when the concordance track doesn't have the sample in the variant track. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5535 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 18:24:19 +00:00
chartl	f6dfdc7f3b	Single-tailed hypothesis testing in MWU git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5533 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 15:53:40 +00:00
hanna	8ae14793f2	Small standalone utility to aggregate BGZF block statistics in a BAM file. Works in the same coordinate space as BAM chunks, so this will be used to calibrate chunk weighting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5531 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 22:25:45 +00:00

... 2 3 4 5 6 ...

4747 Commits (ab1de3bfda858629c26aed51b933add2d564f13e)