gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00
Mark DePristo	f8ef4332de	Count the number of evaluations in AFResult; expand unit tests -- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples -- Added unittests for priors (flat and human) -- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)	2012-10-03 19:55:11 -07:00
Mark DePristo	33c7841c4d	Add tests for non-informative samples in ExactAFCalculationModel	2012-10-03 19:55:11 -07:00
Mark DePristo	de941ddbbe	Cleanup Exact model, better unit tests -- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result. -- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected -- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way -- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef	2012-10-03 19:55:11 -07:00
Mark DePristo	3e01a76590	Clean up AlleleFrequencyCalculation classes -- Added a true base class that only does truly common tasks (like manage call logging) -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation -- All unit tests pass	2012-10-03 19:55:11 -07:00
Mark DePristo	1c52db4cdd	Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection -- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization	2012-10-03 19:55:11 -07:00
Christopher Hartl	ca31ddf2a5	Allow VCFs without PLs to be converted to a bed file with genotypes other than no-call (by setting the minimum GQ to <=0). Performance enhancements to GRM suite.	2012-10-03 21:36:35 -04:00
Kristian Cibulskis	dca7c7fa9c	initial cancer pipeline with mutations and partial indel support	2012-10-03 16:25:34 -04:00
Christopher Hartl	1be8a88909	Changes: 1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest. 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested. 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test. 4) Updates to PlinkBed-handling python scripts and utilities. 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.	2012-10-03 16:02:42 -04:00
Guillermo del Angel	9e1592b8ba	Minor tweaks to CMIProcessing Pipeline: a) don't hard-code job mem limit to 4 G since it's too much for most AWS instances, leave it instead as input argument, b) minor doc cleanups	2012-10-03 12:05:57 -04:00
David Roazen	118e974731	GATK Engine: special-case "monolithic" FilePointers, and allow them to represent multiple contigs Sometimes the GATK engine creates a single monolithic FilePointer representing all regions in all BAM files. In such cases, the monolithic FilePointer is the only FilePointer emitted by the BAMScheduler, and it's safe to allow it to contain regions and intervals from multiple contigs. This fixes support for reading unindexed BAM files (since an unindexed BAM is one case in which the engine creates a monolithic FilePointer).	2012-10-02 15:30:03 -04:00
Mauricio Carneiro	7660e9f820	Reimplementation of the BAM procesing pipeline using the metadata information file. Pipeline runs end-to-end using example metadata and has been tested only for cases where everything is ideal. Next step is to bring this to the cloud, test all different scenario (multiple tumors, single ended, missing parameters etc). Parallel next step is to add QC metrics.	2012-10-02 14:05:34 -04:00
David Roazen	a96ed385df	ReadShard.getReadsSpan(): handle case where shard contains only unmapped mates Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might end up with a shard containing only unmapped mates of mapped reads. In this case, ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped mates of mapped reads located within those intervals.	2012-10-02 13:50:00 -04:00
Mauricio Carneiro	9a8f53e76c	Probably the GATK's most seen typo in the world	2012-10-02 13:34:37 -04:00
David Roazen	ac87ed47bb	BQSR: allow logging recal table updates to a file For testing/debugging purposes only	2012-10-01 14:18:34 -04:00
Christopher Hartl	2508b0f5a7	Merged bug fix from Stable into Unstable	2012-09-29 00:57:43 -04:00
Christopher Hartl	365f1d2429	hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items: 1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts. 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases. 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.	2012-09-29 00:55:31 -04:00
Ami Levy Moonshine	11540da98b	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-28 21:01:25 -04:00
Eric Banks	2df5be702c	Added an argument to RR to allow polyploid consensus creation (by default it is turned off). This will eventually be replaced by the known SNPs track trigger.	2012-09-28 11:44:25 -04:00
Ami Levy Moonshine	fb9457d6fe	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-27 22:40:39 -04:00
David Roazen	e740977994	GATK Engine: do not merge FilePointers that span multiple contigs This affects both the non-experimental and experimental engine paths, and so may break tests, but this is a necessary change.	2012-09-27 18:02:25 -04:00
David Roazen	e82946e5c9	ExperimentalReadShardBalancer: create one monolithic FilePointer per contig Merge all FilePointers for each contig into a single, merged, optimized FilePointer representing all regions to visit in all BAM files for a given contig. This helps us in several ways: -It allows us to create a single, persistent set of iterators for each contig, finally and definitively eliminating all Shard/FilePointer boundary issues for the new experimental ReadWalker downsampling -We no longer need to track low-level file positions in the sharding system (which was no longer possible anyway given the new experimental downsampling system) -We no longer revisit BAM file chunks that we've visited in the past -- all BAM file access is purely sequential -We no longer need to constantly recreate our full chain of read iterators There are also potential dangers: -We hold more BAM index data in memory at once. Given that we merge and optimize the index data during the merge, and only hold one contig's worth of data at a time, this does not appear to be a major issue. TODO: confirm this! -With a huge number of samples and intervals, the FilePointer merge operation might become expensive. With the latest implementation, this does not appear to be an issue even with a huge number of intervals (for one sample, at least), but if it turns out to be a problem for > 1 sample there are things we can do. Still TODO: unit tests for the new FilePointer.union() method	2012-09-27 14:47:54 -04:00
Mauricio Carneiro	a640afa995	adding some directories to gitignore	2012-09-27 11:09:41 -04:00
Mauricio Carneiro	3e68fee764	Removed the intellij files from the root and made an example package for new users. This allows users to start at the same page and then change it as they see fit without interfering with the repo (thanks guillermo!)	2012-09-27 11:04:56 -04:00
Christopher Hartl	abbe757907	Merged bug fix from Stable into Unstable	2012-09-27 00:15:35 -04:00
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Mauricio Carneiro	b9dab068ee	New version of the pipeline starting from an ALIGNED bam going all the way to reducing using n-way out cleaning	2012-09-26 16:16:53 -04:00
Mauricio Carneiro	f8b954334e	Revised implementation of the RAWBAM => BAM pipeline stripped out all the FQ pipeline and tumor/normal information.	2012-09-26 13:37:15 -04:00
Mark DePristo	33b2f65bbd	Script to evaluate SNP and indel calls for the experimental downsampler -- Calls NA12878 with and without the expt. downsampler on chr1 -- Creates combined vcf, annotating sites as overlapping omni SNPs and Mills indels -- Creates simple combined.table that has chr, pos, set, and type to easily ID missed good sites with the new downsampler	2012-09-26 11:33:06 -04:00
Ryan Poplin	f009424952	Adding Phase2 HC calling qscripts for both the original calls and the project consensus.	2012-09-26 11:26:24 -04:00
Ryan Poplin	e49fe74612	Adding some of the qscripts from my BQSR experiments.	2012-09-26 10:55:34 -04:00
Mauricio Carneiro	c9c2682f86	removing annoying xml from IDEA configuration	2012-09-25 17:18:44 -04:00
Mauricio Carneiro	9486131d17	First implementation of the CMI data processing pipeline, handling both germline and cancer BAM/FQ => BAM. Not ready for prime time yet, need more work!	2012-09-25 17:15:42 -04:00
Mauricio Carneiro	cb8d4c97e1	First implementation of a generic 'bundled' Data Processing Pipeline for germline and cancer. not ready for prime time yet!	2012-09-25 17:13:50 -04:00
Mauricio Carneiro	65b100f9b0	Reverting the DPP to the original version, going to create a new simplified version for CMI in private.	2012-09-25 12:02:34 -04:00
Mauricio Carneiro	4324bd72fd	Updating Intellij enviroment and adding Scala	2012-09-25 10:51:53 -04:00
Mark DePristo	e1524ebbc8	NA12878 HiSeq b37 20:10-11 mb test files	2012-09-25 08:54:02 -04:00
Eric Banks	caa431c367	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-24 21:46:36 -04:00
Eric Banks	11a71e0390	RR bug: when determining the most common base at a position, break ties by which base has the highest sum of base qualities. Otherwise, sites with 1 Q2 N and 1 Q30 C are ending up as Ns in the consensus. I think perhaps we don't even care about which base has the most observations - it should just be determined by which has the highest sum of base qualities - but I'm not sure that's what users would expect.	2012-09-24 21:46:14 -04:00
Mauricio Carneiro	4aad135f8c	Generic input file name recognition (still need to implement support to FastQ, but it now can at least accept it)	2012-09-24 17:01:17 -04:00
Mauricio Carneiro	ca84586443	Adding default intellij configuration files	2012-09-24 16:15:57 -04:00
David Roazen	3f44b3e019	Update DataProcessingPipelineTest MD5s	2012-09-24 15:38:07 -04:00
David Roazen	0b488cce66	ExperimentalReadShardBalancer: close() exhausted iterators Fixes a truly awful SAMReaders resource leak reported by Eric -- thanks Eric!	2012-09-24 14:52:59 -04:00
Mark DePristo	9fd30d6f1c	When writing the initial commit for nt + nct I realized this class was really just a ThreadGroupOutputTracker -- The code is cleaner and the logical more obvious now.	2012-09-24 14:15:36 -04:00
Mark DePristo	3e8d992828	Remove bad error test from MicroScheduler, as it's no longer applicable.	2012-09-24 14:15:36 -04:00
Mark DePristo	a6b3497eac	Fixes GSA-515 Nanoscheduler GSA-577 -nt and -nct together appear to not close resources properly -- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker. -- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map. Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead) -- Removed unnecessary debug statement in GenomeLocParser -- nt and nct officially work together now	2012-09-24 14:15:35 -04:00
Mark DePristo	4749fc114f	Temp. disable -nt > 1 and -nct > 1 while bugs are worked out	2012-09-24 14:15:35 -04:00
Mark DePristo	847e79247d	Use SNP only model for NCT big exome test	2012-09-24 14:15:35 -04:00
Mark DePristo	f42e55c9df	Add NCT specific performance test for 5K exomes	2012-09-24 14:15:35 -04:00
Mark DePristo	09bbd2c4c3	Include exception in VCFWriter when one is found when rethrowing as ReviewedStingException	2012-09-24 14:15:35 -04:00

... 22 23 24 25 26 ...

11836 Commits (7dcafe8b8194ce8a9d0b8825812fd11c8f9a0612) All Branches Search

11836 Commits (7dcafe8b8194ce8a9d0b8825812fd11c8f9a0612)

All Branches