gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	44ac49aa34	Removing dependencies in the annotations on extended events. Some refactoring involved in this.	2012-03-30 00:17:02 -04:00
Mauricio Carneiro	cbd21c6339	Nasty, nasty..... VariantEval is overly abusive of the GATKReport (lack of) spec. 1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers. 2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings. Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport. No wonder, the printing needed to be O(n^2).	2012-03-29 17:49:53 -04:00
Eric Banks	c2e27729c7	Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly.	2012-03-29 17:08:25 -04:00
Ryan Poplin	6da9571829	resolving merge conflicts.	2012-03-29 16:16:28 -04:00
Ryan Poplin	ca96544ed0	All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads.	2012-03-29 16:14:29 -04:00
Eric Banks	e4469a83ee	First attempt at removing all traces of extended events from UG; integration tests are expected to fail.	2012-03-29 14:59:29 -04:00
Eric Banks	e61e162c81	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-29 12:33:13 -04:00
Mauricio Carneiro	cf364f26a0	Fixing alignment issue with the GATKReportColumn algorithm Numeric columns were being left-aligned when they should be right-aligned. Fixed it.	2012-03-29 12:28:49 -04:00
Mauricio Carneiro	f80bd4276a	fixed estimated Q reported calculation in the gatherer	2012-03-29 12:28:43 -04:00
Mauricio Carneiro	8a9fb514b6	simplifying GATKReportColumn constructor logic	2012-03-29 12:28:37 -04:00
Eric Banks	e861106398	Accidentally erased important line	2012-03-29 11:08:54 -04:00
Eric Banks	e4a225ed09	Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally.	2012-03-29 11:07:37 -04:00
Guillermo del Angel	c9c3f6b0fc	Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity	2012-03-29 11:05:42 -04:00
Ryan Poplin	9684a2efb0	HaplotypeCaller: Variants found on the same haplotype are now written out with phased genotypes. There are serious eval issues with MNPs so disabling them for now.	2012-03-29 09:41:29 -04:00
Guillermo del Angel	250adca350	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 21:01:49 -04:00
Guillermo del Angel	e0ab4e4b30	Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic	2012-03-28 21:01:31 -04:00
Mauricio Carneiro	8f0e9d74ce	GATKReportTable output refactor writing out a GATKReportTable was O(n^2)!!!!! New implementation is O(n). What a difference, when N = 2^16...	2012-03-28 17:19:12 -04:00
Guillermo del Angel	62ee31afba	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 16:00:38 -04:00
Guillermo del Angel	1eee9d512d	Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal)	2012-03-28 15:41:39 -04:00
Mauricio Carneiro	bb36cd4adf	Quick fixes to BQSRGatherer and GATKReportTable * when gathering, be aware that some keys will be missing from some tables. * when a gatktable has no elements, it should still output the header so we know it had no records	2012-03-28 09:07:54 -04:00
Roger Zurawicki	63cf7ec7ec	Added more primitives to GATK Report Column Type - The Integer column type now accepts byte and shorts - Updated Unit Tests and added a new testParse() test Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-28 09:07:54 -04:00
Guillermo del Angel	08f7d47d7c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 07:42:09 -04:00
Mark DePristo	12aa72f200	Merged bug fix from Stable into Unstable	2012-03-27 22:43:00 -04:00
Mark DePristo	979a84a252	Bugfix for thread unsafe PL cache -- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic -- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.	2012-03-27 22:42:30 -04:00
Guillermo del Angel	8f34412fb8	First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing	2012-03-27 20:59:44 -04:00
Guillermo del Angel	ed322bd73f	Fix again merge issues	2012-03-27 15:03:13 -04:00
Guillermo del Angel	b4a7c0d98d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-27 15:01:03 -04:00
Guillermo del Angel	343a061b1c	Fix merge issues when incorporating new AF calculations changes	2012-03-27 15:00:44 -04:00
Mauricio Carneiro	1b75663178	BQSR Gatherer implementation and integration tests * restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers * optmized empirical qual calculation when merging recalibration reports * centralized the quality score quantization functionalities * unified the creating/loading of all the key manager/hash table structures. * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing) * added integration tests for BQSR and on-the-fly recalibration	2012-03-27 13:50:22 -05:00
Ryan Poplin	5dbd3625cd	Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data.	2012-03-27 13:38:52 -04:00
Eric Banks	c112e0824a	I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there.	2012-03-27 11:09:03 -05:00
Mark DePristo	a638996fe2	Cleanup of VariantEval, diatribe about performance problems with StateKey -- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear -- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.	2012-03-27 11:56:24 -04:00
Mark DePristo	679bb03014	Simple utility function for converting an Iterable<T> to Collection<T>	2012-03-27 11:54:58 -04:00
Mark DePristo	1f5f737c8b	Optimizing the GATKReportTable.write -- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables	2012-03-27 11:54:35 -04:00
Mark DePristo	913c8b231f	Fix ErrorRatePerCycle to overload equals and hashcode -- Fixes failing integration tests	2012-03-27 10:35:32 -04:00
Eric Banks	c07a577ba3	Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).	2012-03-27 00:27:44 -05:00
Mark DePristo	34ea443cdb	Better algorithm for choosing which indel alleles are present in samples -- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors. -- This breakdown is producing spurious clustered indels (lots of these!) around real common indels -- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted. -- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale. -- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP	2012-03-26 16:28:49 -04:00
Mark DePristo	11b6fd990a	GATKReportColumn optimizations -- Was TreeMap even though the sorting wasn't used. Replaced with LinkedHashMap.	2012-03-26 16:28:49 -04:00
Mark DePristo	6be5e82860	VariantEval scalability optimizations -- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function. -- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils -- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE -- VariantEvaluator base class has a cached getSimpleName() function -- VEUtils: general cleanup due to refactoring of StateKey -- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.	2012-03-26 16:28:48 -04:00
Guillermo del Angel	1c424c0daf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-26 15:15:50 -04:00
Ryan Poplin	019145175b	Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method	2012-03-26 11:32:44 -04:00
Ryan Poplin	1fa66f76c9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 23:04:47 -04:00
Guillermo del Angel	ce617b2dfc	Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code	2012-03-25 10:20:21 -04:00
Guillermo del Angel	db54c2625f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 09:53:35 -04:00
Guillermo del Angel	deb4586559	Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1	2012-03-24 21:49:43 -04:00
Mark DePristo	b063bcd38d	Removing update0 support in VariantEval -- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations. -- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications. -- No need to modify the integration tests, this optimization doesn't change the result of the calculation	2012-03-23 21:02:21 -04:00
Mauricio Carneiro	0509d316d9	More information in the recalibration report * added empirical quality counts to allow quantization during on-the-fly recalibration to any level * added number of observations and errors to all tables to enable plotting of all covariates	2012-03-23 16:15:19 -04:00
Mauricio Carneiro	9f74969e3a	BQSR with GATKReport implementation * restructured BQSR to report recalibrated tables. * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration) * linked quality score quantization to the BQSR stage, outputting a quantization histogram * included the arguments used in BQSR to the GATK Report * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities On-the-fly recalibration with GATK Report * loads all tables from the GATKReport using existing infrastructure (with minor updates) * implemented initialiazation of the covariates using BQSR's argument list * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key * applied quality quantization to the base recalibration * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions	2012-03-23 15:42:32 -04:00
Mauricio Carneiro	f421062b55	Updated read group covariate to use sample.lane instead of the id Added Unit test.	2012-03-23 15:24:07 -04:00
Mauricio Carneiro	539da9e3e1	Fixing GATKReport exception handling when loading a report * allowing tables with no description to go through * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)	2012-03-23 15:23:13 -04:00
Eric Banks	2511839068	Merged bug fix from Stable into Unstable	2012-03-23 13:51:33 -04:00
Eric Banks	d3f2bc4361	Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles.	2012-03-23 13:51:00 -04:00
Mark DePristo	e4ec90cfce	Merged bug fix from Stable into Unstable	2012-03-23 11:27:34 -04:00
Mark DePristo	ff26f2bf68	HierarchicalMicroScheduler no longer attempts to wrap exceptions -- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors. Now HMS simply checks whether the throwable it received on error was a RuntimeException. If so, it is stored and rethrow without wrapping later. If it isn't, only in this case is the exception wrapped in a ReviewedStingException. -- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line -- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled. -- I believe this will finally put to rest all of these annoying HMS captures.	2012-03-23 11:27:21 -04:00
Ryan Poplin	9d22471b79	Merged bug fix from Stable into Unstable	2012-03-23 10:48:34 -04:00
Ryan Poplin	ab288354e9	Better error message for malformed input recal file.	2012-03-23 10:47:01 -04:00
Mark DePristo	fee8d86f63	VariantEval optimization -- Use a LinkedHashMap not a TreeMap so iteration is faster. -- Note that with a lot of stratifications the update0 is taking up a lot of time. For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call	2012-03-22 22:13:24 -04:00
Mark DePristo	6df96644d9	Unified, standard IndelSummary metrics for VariantEval -- Now you always get SNP and indel metrics with VariantEval! -- Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions -- Updated VE integration tests as appropriate	2012-03-22 21:24:37 -04:00
Mark DePristo	bcf80cc7b3	Cleanup in VariantEval. Example of molten VariantEval output -- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share. Code updated to use these routines where appropriate -- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF -- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication. -- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden. -- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets. See IndelHistogram class (in later commit) for example of molten VE output	2012-03-22 21:24:37 -04:00
Mark DePristo	9ddd5aec93	More eval modules being removed from VariantEval -- IndelStatistics is superceded by IndelStatistics	2012-03-22 21:24:36 -04:00
Mark DePristo	bd5b6d1aba	Remove no longer in use Eval modules from VariantEval -- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit) -- No more SamplePreviousGenotypes or PhaseStats -- No more MultiallelicAFs	2012-03-22 21:24:36 -04:00
Menachem Fromer	7faa9938b1	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 17:43:44 -04:00
Menachem Fromer	b9b9219ac7	Added respectPhaseInInput flag to RBP and integration tests	2012-03-22 17:40:21 -04:00
Guillermo del Angel	f198cec5e2	Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated	2012-03-22 15:46:39 -04:00
Menachem Fromer	1dfaacfeb5	Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing	2012-03-22 12:40:15 -04:00
Guillermo del Angel	b02ef95bcf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 12:14:12 -04:00
Guillermo del Angel	92676c63ca	Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests	2012-03-22 12:13:59 -04:00
Guillermo del Angel	58965d6a6e	Merged bug fix from Stable into Unstable	2012-03-22 11:04:11 -04:00
Guillermo del Angel	b8cd959461	Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list	2012-03-22 10:06:22 -04:00
Ryan Poplin	a29fc6311a	New debug option to output the assembly graph in dot format. Merge nodes in assembly graph when possible.	2012-03-21 15:48:55 -04:00
Eric Banks	8c09ff9459	Merged bug fix from Stable into Unstable	2012-03-21 12:44:43 -04:00
Eric Banks	58245bfa2f	Bug fix: check to see whether there's a BasePileup before asking for one.	2012-03-21 12:44:09 -04:00
Eric Banks	07c3bd32b3	Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention...	2012-03-21 12:42:13 -04:00
Eric Banks	dcf2fa361d	Minor cleanup	2012-03-21 12:14:31 -04:00
Eric Banks	ab1c48745b	Need to catch RuntimeExceptions coming out of Picard too so that they show up as UserErrors (some BAM errors are thrown as REs).	2012-03-21 12:13:52 -04:00
Ryan Poplin	9e10779fa7	Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests.	2012-03-21 08:45:42 -04:00
Mauricio Carneiro	0e93cf5297	Taking care of bad cigars in the GATK * fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events. * added Unit tests for BadCigarFilter * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)	2012-03-20 14:32:57 -04:00
Eric Banks	5e79046c98	Minor change but I realized from Mark's commit that the code I stole it from was flawed	2012-03-20 08:55:56 -04:00
Eric Banks	ade1971581	Since we allow any generic header types, there's no longer any reason to check for supported types	2012-03-20 00:12:17 -04:00
Eric Banks	2324c5a74f	Simplified the interface for simple VCF header lines by making the VCFSimpleHeaderLine not abstract anymore - now any arbitrary header line with an ID (e.g. the contig and ALT lines) can be part of this class without having to define new classes. Also, renamed the 'named' header line to 'id' since that's more accurate.	2012-03-19 21:29:24 -04:00
Roger Zurawicki	7afb333811	GATK Report code cleanup - Updated the documentation on the code - Made the table.write() method private and updated necessary files. - Added a constructor to GATKReport that takes GATKReportTables - Optimized my code Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-19 11:53:57 -04:00
Mauricio Carneiro	0d4ea30d6d	Updating the BQSR Gatherer to the new file format This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.	2012-03-19 09:02:27 -04:00
Eric Banks	9223e451a3	Merged bug fix from Stable into Unstable	2012-03-18 00:54:19 -04:00
Eric Banks	344a938a70	When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array.	2012-03-18 00:36:30 -04:00
Eric Banks	be9e48ba29	Merged bug fix from Stable into Unstable	2012-03-16 14:33:53 -04:00
Mauricio Carneiro	ec4a870a0f	Added @PG tag to ReduceReads Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.	2012-03-16 14:09:07 -04:00
Mauricio Carneiro	3bfca0ccfd	BitSet implementation of the on-the-fly recalibration using the CSV format file. Infrastructure: * Added static interface to all different clipping algorithms of low quality tail clipping * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration) * EventType is now an independent enum with added capabilities. All functionality is now centralized. BQSR and RecalibrateBases: * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint * Refactored the object creation to take advantage of the compact key structure * Replaced nested hash maps with single hash maps indexed by bitsets * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm). * Excluded contexts with N's from the output file. * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!) * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality) * Added the covariate ID (for optional covariates) to the output for disambiguation purposes * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum. * Reduced memory usage of the BQSR script to 4 Tests: * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager * Added tests for keys without optional covariates * Added tests for on-the-fly recalibration (but more tests are necessary)	2012-03-16 13:02:15 -04:00
Mauricio Carneiro	ca11ab39e7	BitSets keys to lower BQSR's memory footprint Infrastructure: * Generic BitSet implementation with any precision (up to long) * Two's complement implementation of the bit set handles negative numbers (cycle covariate) * Memoized implementation of the BitSet utils for better performance. * All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow. * Replace log/sqrt with bitwise logic to get rid of numerical issues BQSR: * All covariates output BitSets and have the functionality to decode them back into Object values. * Covariates are responsible for determining the size of the key they will use (number of bits). * Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type * No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead Tests: * Unit tests added to every method of BitSetUtils * Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager) * Unit tests added to the cycle and context covariates (will add unit tests to all covariates)	2012-03-16 13:01:48 -04:00
Eric Banks	dce6b91f7d	Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing.	2012-03-16 11:14:37 -04:00
Eric Banks	41068b6985	The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code.	2012-03-15 16:08:58 -04:00
Ryan Poplin	0c6b34e9df	Fixing a bug identified by the ActivityProfile unit tests	2012-03-15 14:24:30 -04:00
Ryan Poplin	252b830aa8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-15 11:56:04 -04:00
Ryan Poplin	1429ddcf55	Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine	2012-03-14 21:25:43 -04:00
Mark DePristo	7c5cdb51c2	UnitTests for ActivityProfile and minor ART cleanup -- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-( -- UnitTesting framework for ActivityProfile -- needs to be expanded -- Minor helper functions for ActiveRegion to help with unit tests	2012-03-14 17:26:37 -04:00
Mark DePristo	e440c9be98	Clean up logic for adding reads to ART cache -- No longer has duplicate code	2012-03-14 17:26:37 -04:00
Mark DePristo	5bcb5c7433	Preliminary refactoring of ART -- Refactored ART into clearer, simpler procedures. Attempted to merge shared code into utility classes. -- Added some docs -- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive -- Separated band-pass filtering from creation of active regions. Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions -- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions() -- Many TODOs in ActivityProfile.	2012-03-14 17:26:37 -04:00
Ryan Poplin	1da8928407	HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes.	2012-03-14 15:22:21 -04:00
Guillermo del Angel	eca055ccad	Add option in ValidationAmplicons to only output SNPs and INDELs, ignoring complex variants (or SVs, etc.)	2012-03-14 14:26:40 -04:00
Eric Banks	f7c2c818fe	Exact model memory optimization: instead of having a later matrix column pull in data from earlier ones (requiring us to keep them around until all dependencies are hit), the earlier columns push data into their dependents immediately and then are removed. This does trade off speed a little bit (because we need to call approximateLog10Sum each time we add to a dependent instead of once in an array at the end). Note that this commit would normally not get pushed into stable, but I'm about to make a very disruptive push into stable that would make merging this from unstable a nightmare.	2012-03-14 14:02:36 -04:00
Mark DePristo	6a40ca6bec	Merged bug fix from Stable into Unstable	2012-03-14 12:19:33 -04:00
Mark DePristo	bb2c10b785	Capture the class of the exception in GATKRunReport -- As suggested by David.	2012-03-14 12:16:22 -04:00
Ryan Poplin	78a4e7e45e	Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner.	2012-03-14 12:05:05 -04:00
Eric Banks	77243d0df1	Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me	2012-03-13 16:31:51 -04:00
Eric Banks	568a1362f5	Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me	2012-03-13 16:19:15 -04:00
Eric Banks	5d7c761784	Merged bug fix from Stable into Unstable	2012-03-13 11:01:03 -04:00
Eric Banks	5200f7f919	When creating a synthetic VC based on the passed in alleles, set the reference base for indel.	2012-03-13 10:59:58 -04:00
Eric Banks	1675bd4dd7	When creating a synthetic VC based on the passed in alleles, set the length correctly.	2012-03-13 10:55:52 -04:00
Roger Zurawicki	7887a06703	GATKReport v1.0 GATKReport format changes: - All non-data header lines are preceeded with a single pound ( #:) - Every report now has a report header containing the version number and number of tables - Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description. - This new format will allow reports in the future to be gatherable. - Changed the header format to include an end-of-line string ":;" Added features: - Simplified GATK Reports: The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report. A simple GATK Report consists of: - A single table - No primary key ( it is hidden ) Optional: - Only untyped columns. As long as the data is an Object, it will be accepted. - Default column values being empty strings. Limitations: - A simple GATK report cannot contain multiple tables. - It cannot contain typed columns, which prevents arithmetic gathering. - Added a constructor to generate simplified GATK reports. - Added a method to easily add data to simple GATK reports. - Upgraded the input parser take advantage of the new file format (v1). - Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data. - Made some GATKReport methods public, and added more setters and getters. - Added method that compares formats of two GATKReports, and added an equals method to verify all data inside. - The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.) - Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map. - Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered. Test changes: - Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1. - Updated the MD5 hashes in integration tests throughout the GATK. Other changes: - Added the gatherer functions to CoverageByRG - Also added the scatterCount parameter in the Interval Coverage script - Dropped support for reading in legacy GATKReport formats ( v0.) - Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints. - Rewrote the read file method for GATK report files. - Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-12 23:09:19 -04:00
Eric Banks	10995d349e	Fix old error message	2012-03-12 22:56:08 -04:00
Eric Banks	2314787767	Generalizing to avoid JDK 1.7 incompatibilities	2012-03-12 22:50:59 -04:00
Ryan Poplin	03223029e3	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-12 09:42:37 -04:00
Eric Banks	b4749757f8	Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release.	2012-03-12 01:07:07 -04:00
Ryan Poplin	2836c161ee	Moving trimToVariableRegion out of reduced reads and into a public static ReadClipper function. HaplotypeCaller clips reads to the active region boundries before passing to the HMM. The philosophy of the HC is moving towards genotyping the entire haplotype sequence contained within the active region as a single allele.	2012-03-11 14:45:59 -04:00
Mark DePristo	1ee46e5c06	Collect only the bare essentials in the GATKRunReport Now looks like: <GATK-run-report> <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id> <start-time>2012/03/10 20.21.19</start-time> <end-time>2012/03/10 20.21.19</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>105</iterations> </GATK-run-report> No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy	2012-03-10 20:27:14 -05:00
Mark DePristo	3ba2e5667c	CalibrateGenotypesLikelihoods include pOfDGivenD now	2012-03-09 16:00:07 -05:00
David Roazen	91d10431d3	BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier. Later on we might want to address in a more general way the fact that we validate user intervals against the reference but not against the merged BAM header produced by the engine at runtime.	2012-03-09 15:20:16 -05:00
David Roazen	bc65f6326f	Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a ReviewedStingException (since this is our problem, not the user's) rather than allowing a cryptic buffer underflow error to occur. Note that this change does not fix the underlying problem in the engine, if there is one (there may be an as-yet-undetected bug in the code that writes the bam schedule). It will just make it easier for us to identify what's going wrong in the future.	2012-03-09 12:33:48 -05:00
David Roazen	32dee7ed9b	Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be thrown when a BAM index file was truncated or corrupt. Now, a UserException is thrown in this situation instructing the user to re-index the BAM. Added a unit test for this case as well.	2012-03-08 15:30:44 -05:00
Guillermo del Angel	c04853eae6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-08 12:30:04 -05:00
Guillermo del Angel	858acf8616	Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns)	2012-03-08 12:29:44 -05:00
Andrey Sivachenko	56f074b520	docs updated	2012-03-07 18:47:15 -05:00
Andrey Sivachenko	117ea605ac	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-07 18:35:07 -05:00
Andrey Sivachenko	497a1b059e	transition to JEXL completed, old parameters setting individual cutoffs now deprecated	2012-03-07 18:34:11 -05:00
Andrey Sivachenko	fbd2f04a04	JEXL support added; intermediate commit, not yet functional	2012-03-07 17:29:42 -05:00
Mark DePristo	0376d73ece	Improved, public version of ErrorRateByCycle -- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example -- Integration tests -- Minor improvements to GATKReportTable with methods to getPrimaryKeys	2012-03-07 13:10:08 -05:00
Christopher Hartl	a6a8fc0521	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-03-07 10:05:43 -05:00
Mark DePristo	569be953b9	Bugfix for VariantEval -- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report. -- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.	2012-03-06 16:56:59 -05:00
Christopher Hartl	67def6acc8	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-03-06 14:23:14 -05:00
Christopher Hartl	20c1fbaf0f	Fixing a merge (turning off downsampling on DoC)	2012-03-06 14:22:45 -05:00
David Roazen	0702ee1587	Public-key authorization scheme to restrict use of NO_ET -Running the GATK with the -et NO_ET or -et STDOUT options now requires a key issued by us. Our reasons for doing this, and the procedure for our users to request keys, are documented here: http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home -A GATK user key is an email address plus a cryptographic signature signed using our private key, all wrapped in a GZIP container. User keys are validated using the public key we now distribute with the GATK. Our private key is kept in a secure location. -Keys are cryptographically secure in that valid keys definitely came from us and keys cannot be fabricated, however keys are not "copy-protected" in any way. -Includes private, standalone utilities to create a new GATK user key (GenerateGATKUserKey) and to create a new master public/private key pair (GenerateKeyPair). Usage of these tools will be documented on the internal wiki shortly. -Comprehensive unit/integration tests, including tests to ensure the continued integrity of the GATK master public/private key pair. -Generation of new user keys and the new unit/integration tests both require access to the GATK private key, which can only be read by members of the group "gsagit".	2012-03-06 00:09:43 -05:00
Lechu	027843d791	I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them. Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>	2012-03-05 21:27:03 -05:00
Ryan Poplin	9b53250bef	Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode.	2012-03-05 21:07:36 -05:00
Ryan Poplin	b37461587d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-05 17:54:59 -05:00
Ryan Poplin	c6ded4d23c	Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes.	2012-03-05 17:54:42 -05:00
Ryan Poplin	14a77b1e71	Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly.	2012-03-05 12:28:32 -05:00
Mauricio Carneiro	e9ad382e74	unifying the BQSR argument collection	2012-03-05 10:48:26 -05:00
Ryan Poplin	f879daa7d0	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-05 08:29:08 -05:00
Ryan Poplin	d6871967ae	Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably.	2012-03-05 08:28:42 -05:00
Guillermo del Angel	3b5a7c34d7	Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream	2012-03-04 10:24:29 -05:00
Mark DePristo	69611af7d3	Workaround for bug in Picard in ReadGroupProperties -- NPE caused when you call getRunDate on a read group without a date.	2012-03-02 18:53:45 -05:00
Mark DePristo	ba71b0aee4	ReadGroupProperties mk3 -- Includes sequencing date	2012-03-02 16:12:42 -05:00
Eric Banks	1e07e97b58	Optimization: create allele list just once, not for each genotype	2012-03-02 13:30:17 -05:00
Ryan Poplin	0ad7d5fbc1	Standalone common Pair HMM utility class with associated unit tests.	2012-03-01 22:41:13 -05:00
Mark DePristo	2f334a57c2	ReadGroupProperties mk2 -- Includes paired end status (T/F) -- Includes count of reads used in calculation -- Includes simple read type (2x76 for example) -- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0	2012-03-01 18:43:53 -05:00
Mauricio Carneiro	486712bfc2	ugly RG encoding	2012-03-01 17:56:45 -05:00
Mark DePristo	aff508e091	ReadGroupProperties walker and associated infrastructure -- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file. -- Median tool that collects up to a given maximum number of elements and returns the median of the elements. -- Unit and integration tests for everything. -- Making name of TestProvider protected so subclasses and override name more easily	2012-03-01 15:01:11 -05:00
Mauricio Carneiro	9e95b10789	Context covariate now operates as a highly compressed bitset * All contexts with 'N' bases are now collapsed as uninformative * Context size is now represented internally as a BitSet but output as a dna string * Temporarily disabled sorted outputs because of null objects	2012-02-29 19:25:21 -05:00
Mauricio Carneiro	d379c3763a	DNA Sequence to BitSet and vice-versa conversion tools * Turns DNA sequences (for context covariates) into bit sets for maximum compression * Allows variable context size representation guaranteeing uniqueness. * Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary). * Unit Tests added	2012-02-29 19:25:20 -05:00
Eric Banks	129b5e7f6b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-28 10:09:34 -05:00
Eric Banks	a4a279ce80	Damn you, Mark	2012-02-28 10:09:09 -05:00
Khalid Shakir	0681bea5a5	Changed DoC from PartitionType.INTERVAL to PartitionType.NONE since it doesn't have a way to gather scattered outputs. Added MultiallelicSummary to HSP eval.	2012-02-28 09:27:27 -05:00
Eric Banks	bd398e30fd	Another quick optimization	2012-02-28 09:25:35 -05:00
Eric Banks	40bdadbda5	Minor optimization as per Mark	2012-02-28 09:24:07 -05:00
Eric Banks	d7928ad669	Drat, missed one: handle null alleles being passed in.	2012-02-27 21:31:54 -05:00
Mark DePristo	24356f11b7	Merged bug fix from Stable into Unstable -- Resolved conflict Conflicts: public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java	2012-02-27 17:13:17 -05:00
Mark DePristo	0b29d54937	Changed most BAMSchedule ReviewedStingExceptions to UserExceptions -- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.	2012-02-27 17:08:41 -05:00
Mark DePristo	f9e8e82e33	Removed unused class variable from VCFHeaderLineTranslator	2012-02-27 17:07:19 -05:00
Mark DePristo	100ddef930	Fix typo in VariantContextBuilder	2012-02-27 17:06:45 -05:00
Mark DePristo	5f7ccdcc01	Avoid calling getBasePileup when there's no pileup in NBaseCount annotation	2012-02-27 15:12:25 -05:00
Mark DePristo	729bb954e2	Throws ReviewedStingException for a bug when parent VariantContext argument is null	2012-02-27 15:09:00 -05:00
Eric Banks	998ed8fff3	Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous).	2012-02-27 14:56:10 -05:00
Mark DePristo	4d9582de77	More general catching of Exceptions in interval reading to throw MalformedFile exception in all cases -- Now throws UserException no matter what happens during the reading of the intervals file.	2012-02-27 14:02:26 -05:00
Mark DePristo	9712fed7a5	Trap SAMFormatException and rethrow as MalformatedBAM exception -- Trap errors in header and rethrow -- Wrap underlying iterator in MalformatedBAMErrorReformattingIterator	2012-02-27 13:52:50 -05:00
Eric Banks	64754e7870	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-27 11:31:41 -05:00
Eric Banks	850c5d0db2	Enabling Rank Sum Tests for multi-allelics: use ref vs any alt allele.	2012-02-27 09:59:36 -05:00
Eric Banks	dfdf4f989b	Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC.	2012-02-27 09:50:09 -05:00
Guillermo del Angel	16122bea8d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-25 13:57:54 -05:00
Guillermo del Angel	dea35943d1	a) Bug fix in calling new functions that give indel bases and length from regular pileup in LocusIteratorByState, b) Added unit test to cover these.	2012-02-25 13:57:28 -05:00
Mark DePristo	c8a06e53c1	DoC now properly handles reference N bases + misc. additional cleanups -- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage. -- Added option --includeRefNSites that will include them in the calculation -- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included -- Reorganized command line options, tagging advanced options with @Advanced	2012-02-25 11:32:50 -05:00
Guillermo del Angel	c9a4c74f7a	a) Bug fixes for last commit related to PileupElements (unit tests are forthcoming). b) Changes needed to make pool caller work in GENOTYPE_GIVEN_ALLELES mode c) Bug fix (yet again) for UG when GENOTYPE_GIVEN_ALLELES and EMIT_ALL_SITES are on, when there's no coverage at site and when input vcf has genotypes: output vcf would still inherit genotypes from input vcf. Now, we just build vc from scratch instead of initializing from input vc. We just take location and alleles from vc	2012-02-24 10:27:59 -05:00
Mauricio Carneiro	ee9a56ad27	Fix subtle bug in the ReduceReads stash reported by Adam * The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it. * Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.	2012-02-23 18:35:25 -05:00
Mark DePristo	e0c189909f	Added support for breakpoint alleles -- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic -- Added integrationtest to ensure that we can parse and write out breakpoint example	2012-02-23 12:14:48 -05:00
Guillermo del Angel	6866a41914	Added functionality in pileups to not only determine whether there's an insertion or deletion following the current position, but to also get the indel length and involved bases - definitely needed for extended event removal, and needed for pool caller indel functionality.	2012-02-23 09:45:47 -05:00
Eric Banks	d34f07dba0	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-22 20:41:03 -05:00
Ryan Poplin	2b6c0939ab	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-22 19:00:38 -05:00
Ryan Poplin	8695738400	Bug fix in HaplotypeCaller's GENOTYPE_GIVEN_ALLELES mode for insertions greater than length 1. The allele being genotyped was off by one base pair.	2012-02-22 19:00:04 -05:00
Christopher Hartl	2c1b14d35e	Mostly small changes to my own scala scripts: .vcf.gz compatibility for output files, smarter beagle generation, simple script to scatter-gather combine variants. Whole genome indel calling now uses the gold standard indel set.	2012-02-22 17:20:04 -05:00
Mauricio Carneiro	75783af6fc	int <-> BitSet conversion utils for MathUtils * added unit tests.	2012-02-21 14:10:36 -05:00
Guillermo del Angel	0f5674b95e	Redid fix for corner case when forming consensus with reads that start/end with insertions and that don't agree with each other in inserted bases: since I can't iterate over the elements of a HashMap because keys might change during iteration, and since I can't use ConcurrentHashMaps, the code now copies structure of (bases, number of times seen) into ArrayList, which can be addressed by element index in order to iterate on it.	2012-02-20 09:12:51 -05:00
Ryan Poplin	3d9eee4942	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-18 10:55:29 -05:00
Ryan Poplin	a8be96f63d	This caching in the BQSR seems to be too slow now that there are so many keys	2012-02-18 10:54:39 -05:00
Ryan Poplin	78718b8d6a	Adding Genotype Given Alleles mode to the HaplotypeCaller. It constructs the possible haplotypes via assembly and then injects the desired allele to be genotyped.	2012-02-18 10:31:26 -05:00
Guillermo del Angel	e724c63f2b	Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on	2012-02-17 17:18:43 -05:00
Guillermo del Angel	f2ef8d1d23	Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on	2012-02-17 17:15:53 -05:00
Guillermo del Angel	3e031a540f	Solve merge conflict	2012-02-17 10:56:03 -05:00
Guillermo del Angel	cd352f502d	Corner case bug fix: if a read starts with an insertion, when computing the consensus allele for calling the insertion was only added to the last element in the consensus key hash map. Now, an insertion that partially overlaps with several candidate alleles will have their respective count increased for all of them	2012-02-17 10:21:37 -05:00
Eric Banks	2f33c57060	No reason to restrict HaplotypeScore to bi-allelic SNPs when the plumbing for multi-allelic events is already present.	2012-02-16 13:58:00 -05:00
Guillermo del Angel	2f08846d82	Merged bug fix from Stable into Unstable	2012-02-14 21:26:25 -05:00
Guillermo del Angel	7dc6f73399	Bug fix for validation site selector: records with AC=0 in them were always being thrown out if input vcf was sites-only, even when -ignorePolymorphicStatus flag was set	2012-02-14 21:11:24 -05:00
Ryan Poplin	30085781cf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-14 14:01:20 -05:00
Ryan Poplin	ae5b42c884	Put base insertion and base deletions in the SAMRecord as a string of quality scores instead of an array of bytes. Start of a proper genotype given alleles mode in HaplotypeCaller	2012-02-14 14:01:04 -05:00
David Roazen	85d31f80a2	Merged bug fix from Stable into Unstable	2012-02-13 16:37:11 -05:00
David Roazen	03e5184741	Fix serious engine bug that could cause reads to be dropped under certain circumstances When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine file spans when it can. Unfortunately, the method that combines two BAM file spans was seriously flawed, and would produce a truncated union if the file spans overlapped in certain ways. This could cause entire regions of the BAM file containing reads within the requested intervals to be dropped. Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests to verify that the correct union is produced regardless of how the file spans happen to overlap. Thanks to Khalid, who did at least as much work on this bug as I did.	2012-02-13 16:25:21 -05:00
Eric Banks	ad90af94ed	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-13 15:10:10 -05:00
Eric Banks	0920a1921e	Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics.	2012-02-13 15:09:53 -05:00
Eric Banks	14981bed10	Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records.	2012-02-13 14:32:03 -05:00
Ryan Poplin	e9338e2c20	Context covariate needs to look in the reverse direction for negative stranded reads.	2012-02-13 13:40:41 -05:00
Ryan Poplin	41ffd08d53	On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live.	2012-02-13 12:35:09 -05:00
Ryan Poplin	3caa1b83bb	Updating HC integration tests	2012-02-11 11:48:32 -05:00
Ryan Poplin	9b8fd4c2ff	Updating the half of the code that makes use of the recalibration information to work with the new refactoring of the bqsr. Reverting the covariate interface change in the original bqsr because the error model enum was moved to a different class and didn't make sense any more.	2012-02-11 10:57:20 -05:00
Eric Banks	f52f1f659f	Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1.	2012-02-10 14:15:59 -05:00
Mauricio Carneiro	1fb19a0f98	Moving the covariates and shared functionality to public so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.	2012-02-10 11:44:01 -05:00
Eric Banks	5e18020a5f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-10 11:08:33 -05:00
Eric Banks	f53cd3de1b	Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent.	2012-02-10 11:07:32 -05:00
Mauricio Carneiro	5af373a3a1	BQSR with indels integrated! * added support to base before deletion in the pileup * refactored covariates to operate on mismatches, insertions and deletions at the same time * all code is in private so original BQSR is still working as usual in public * outputs a molten CSV with mismatches, insertions and deletions, time to play! * barely tested, passes my very simple tests... haven't tested edge cases.	2012-02-09 18:46:45 -05:00
Eric Banks	7a937dd1eb	Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly.	2012-02-09 16:14:22 -05:00
Eric Banks	0f728a0604	The Exact model now subsets the VC to the first N alleles when the VC contains more than the maximum number of alleles (instead of throwing it out completely as it did previously). [Perhaps the culling should be done by the UG engine? But theoretically the Exact model can be called outside of the UG and we'd still want the context subsetted.]	2012-02-09 14:02:34 -05:00
Matt Hanna	aa097a83d5	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-09 11:26:48 -05:00
Matt Hanna	b57d4250bf	Documentation request by Eric. At each stage of the GATK where filtering occurs, added documentation suggesting the goal of the filtering along with examples of suggested inputs and outputs.	2012-02-09 11:24:52 -05:00
Mauricio Carneiro	d561914d4f	Revert "First implementation of GATKReportGatherer" premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport. This reverts commit aea0de314220810c2666055dc75f04f9010436ad.	2012-02-08 23:28:55 -05:00
Eric Banks	2f800b078c	Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again.	2012-02-08 15:27:16 -05:00
Matt Hanna	51ac87b28c	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-08 08:43:55 -05:00
Matt Hanna	5b58fe741a	Retiring Picard customizations for async I/O and cleaning up parts of the code to use common Picard utilities I recently discovered. Also embedded bug fix for issues reading sparse shards and did some cleanup based on comments during BAM reading code transition meetings.	2012-02-08 08:34:37 -05:00
Roger Zurawicki	c0c676590b	First implementation of GATKReportGatherer - Added the GATKReportGatherer - Added private methods in GATKReport to combine Tables and Reports - It is very conservative and it will only gather if the table columns, match. - At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data. Added the gatherer functions to CoverageByRG Also added the scatterCount parameter in the Interval Coverage script Made some more GATKReport methods public The UnitTest included shows that the merging methods work Added a getter for the PrimaryKeyName Fixed bugs that prevented the gatherer form working Working GATKReportGatherer Has only the functional to addLines The input file parser assumes that the first column is the primary key Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-02-07 18:14:47 -05:00
Mauricio Carneiro	e89887cd8e	laying groundwork to have insertions and deletions going through the system.	2012-02-07 18:11:53 -05:00
Mauricio Carneiro	0d3ea0401c	BQSR Parameter cleanup * get rid of 320C argument that nobody uses. * get rid of DEFAULT_READ_GROUP parameter and functionality (later to become an engine argument).	2012-02-07 14:42:11 -05:00
Eric Banks	717cd4b912	Document -L unmapped	2012-02-07 13:30:54 -05:00
Eric Banks	718da7757e	Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes.	2012-02-07 13:15:58 -05:00
Eric Banks	9d1a19bbaa	Multi-allelic indels were not being printed out correctly in VariantsToTable; fixed.	2012-02-06 22:49:29 -05:00
Mauricio Carneiro	5961868a7f	fixup for BQSR (HC integration tests) In the new BQSR implementation, covariates do depend on the RecalibrationArgumentCollection.	2012-02-06 22:47:27 -05:00
Mauricio Carneiro	6e6f0f10e1	BaseQualityScoreRecalibration walker (bqsr v2) first commit includes * Adding the context covariate standard in both modes (including old CountCovariates) with parameters * Updating all covariates and modules to use GATKSAMRecord throughout the code. * BQSR now processes indels in the pileup (but doesn't do anything with them yet)	2012-02-06 17:38:29 -05:00
Eric Banks	0717c79901	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-06 16:23:36 -05:00
Eric Banks	91897f5fe7	Transpose rows/cols in AF table to make it molten (so I can plot easily in R)	2012-02-06 16:23:32 -05:00
Guillermo del Angel	fb5786385c	Merged bug fix from Stable into Unstable	2012-02-06 13:22:56 -05:00
Guillermo del Angel	6ec686b877	Complement to previous commit: make sure we also don't inherit filter from input VCF when genotyping at an empty site	2012-02-06 13:19:26 -05:00
Guillermo del Angel	93ffca1e3a	Merged bug fix from Stable into Unstable	2012-02-06 11:58:58 -05:00
Guillermo del Angel	827be878b4	Bug fix when running UG in GenotypeGivenAlleles mode: if an input site to genotype had no coverage, the output VCF had AC,AF and AN inherited from input VCF, which could have nothing to do with given BAM so numbers could be non-sensical. Now new vc has clear attributes instead of attributes inherited from input VCF.	2012-02-06 11:58:13 -05:00
Eric Banks	fbbd04621d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-06 11:53:31 -05:00
Eric Banks	edb4edc08f	Commented out unused metrics for now	2012-02-06 11:53:15 -05:00
Ryan Poplin	096c23a473	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-06 11:10:38 -05:00
Ryan Poplin	dc05b71e39	Updating Covariate interface with Mauricio to include an errorModel parameter. On the fly recalibration of base insertion and base deletion quals is live for the HaplotypeCaller	2012-02-06 11:10:24 -05:00
Guillermo del Angel	1e11408f8b	Merged bug fix from Stable into Unstable	2012-02-06 10:34:26 -05:00
Guillermo del Angel	090d87b48b	Bug fix in ValidationSiteSelector: when input vcf had genotypes and was multiallelic, the parsing of the AF/AC fields was wrong. Better logic to unify parsing of field	2012-02-06 10:33:12 -05:00
Eric Banks	9d94f310f1	Break AF histogram into max and min AFs	2012-02-06 09:01:19 -05:00
Ryan Poplin	b7ffd144e8	Cleaning up the covariate classes and removing unused code from the bqsr optimizations in 2009.	2012-02-06 08:54:42 -05:00
Eric Banks	cef550903e	Minor optimization	2012-02-06 00:48:00 -05:00
Ryan Poplin	5343f8ba67	Initial version of on-the-fly, lazy loading base quality score recalibration. It isn't completely hooked up yet but I'm committing so Mauricio and Mark can see how I envision it will fit together. Look it over and give any feedback. With the exception of the Solid specific code we are very very close to being able to remove TableRecalibrationWalker from the code base and just replace it with PrintReads -BQSR recal.csv	2012-02-05 13:09:03 -05:00
Ryan Poplin	f94d547e97	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-03 17:14:20 -05:00
Ryan Poplin	894d3340be	Active Region Traversal should use GATKSAMRecords everywhere instead of SAMRecords. misc cleanup.	2012-02-03 17:13:52 -05:00
Mauricio Carneiro	4a57add6d0	First implementation of DiagnoseTargets * calculates and interprets the coverage of a given interval track * allows to expand intervals by specified number of bases * classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY. * outputs text file for now (testing purposes only), soon to be VCF. * filters are overly aggressive for now.	2012-02-03 17:12:43 -05:00
Mauricio Carneiro	3dd6a1f962	Adding some generic sum and average functions to MathUtils	2012-02-03 17:12:43 -05:00
Mauricio Carneiro	e1d69e4060	make the size of a GenomeLoc int instead of long it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.	2012-02-03 17:12:42 -05:00
Ryan Poplin	0e44430e47	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-03 13:45:11 -05:00
Christopher Hartl	aa3638ecb3	Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-03 13:42:09 -05:00
Eric Banks	3abfbcbcf2	Generalized the TDT for multi-allelic events	2012-02-03 12:23:21 -05:00
Ryan Poplin	601e53d633	Fix when specifying preset active regions with -AR argument	2012-02-02 16:34:26 -05:00
Christopher Hartl	0111505ea9	Terrible. Swapping the paternal and sample ids.	2012-02-02 11:41:16 -05:00
Ryan Poplin	1f50f6970b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-02 10:17:13 -05:00
Ryan Poplin	4ed06801a7	Updating HaplotypeCaller's HMM calc to use GOP as a function of the read instead of a function of the haplotype in preparation for IQSR	2012-02-02 10:17:04 -05:00
Matt Hanna	8adfc79123	Merged bug fix from Stable into Unstable	2012-02-01 16:07:41 -05:00
Matt Hanna	30b937d2af	Fix bug discovered in FGTP branch in which BlockInputStream returns -1 in cases where some data could be read, but not all the data requested by the caller.	2012-02-01 16:06:22 -05:00
Mauricio Carneiro	45da892ecc	Better exceptions to catch malformed reads * throw exceptions in LocusIteratorByState when hitting reads starting or ending with deletions	2012-02-01 11:56:19 -05:00
Christopher Hartl	810996cfca	Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray!	2012-02-01 10:39:03 -05:00
Christopher Hartl	25d943f706	Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-01 10:32:11 -05:00
Ryan Poplin	056b24ccd6	Resolving merge conflicts with LocusIteratorByState	2012-01-31 16:13:32 -05:00
Ryan Poplin	febc634557	Changing PileupElement's isSoftClipped to isNextToSoftClip since soft clipped bases aren't actually added to pileups, oops. Removing the intrinsic clustered variants filter from the HaplotypeCaller	2012-01-31 16:06:14 -05:00
Matt Hanna	7f70612beb	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-31 11:59:25 -05:00
Matt Hanna	a630db1703	Oops...HierarchicalMicroScheduler was transforming any exception from the walker level into a ReviewedStingException. Thanks to Ryan for pointing this out.	2012-01-31 11:58:21 -05:00
Christopher Hartl	faba3dd530	Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-31 10:25:29 -05:00
Mauricio Carneiro	17dbe9a95d	A few cleanups in the LocusIteratorByState * No more N's in the extended event pileups * Only add to the pileup MQ0 counter if the read actually goes into the pileup	2012-01-31 09:40:51 -05:00
Ryan Poplin	f9162ea705	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-30 19:45:19 -05:00
Ryan Poplin	abb91cf26b	Increasing the size of the active regions that are produced by the active probability integrator, more context is needed to call more complex events	2012-01-30 15:36:12 -05:00
Mauricio Carneiro	d5d4fa8a88	Fixed discordance bug reported by Brad Chapman discordance now reports discordance between genotypes as well (just like concordance)	2012-01-30 09:50:45 -05:00
Mark DePristo	3164c8dee5	S3 upload now directly creates the XML report in memory and puts that in S3 -- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.	2012-01-29 15:14:58 -05:00
Menachem Fromer	0e17cbbce9	Merged bug fix from Stable into Unstable	2012-01-27 16:03:16 -05:00
Menachem Fromer	a9671b73ca	Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1)	2012-01-27 16:01:30 -05:00
Ryan Poplin	f7ac1f4a69	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-27 15:12:55 -05:00
Ryan Poplin	fc08235ff3	Bug fix in active region traversal, locusView.getNext() skips over pileups with zero coverage but still need to count them in the active probability integrator	2012-01-27 15:12:37 -05:00
Mark DePristo	0f2e8400b5	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-27 10:12:50 -05:00
Mauricio Carneiro	ec9920b04f	Updating the SAM TAG for Original Alignment Start to "OP" per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.	2012-01-27 08:51:39 -05:00
Mark DePristo	13d1626f51	Minor improvements in ref QC walker. Unfortunately this doesn't actually catch Chris's error	2012-01-27 08:24:22 -05:00
Mauricio Carneiro	0d4027104f	Reduced reads are now aware of their original alignments * Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end). * Tags keep the alignment shifts, not real alignment, for better compression * Tags are defined in the GATKSAMRecord * GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd() * Updated ReduceReads MD5s accordingly	2012-01-26 17:06:36 -05:00
Eric Banks	07f72516ae	Unsupported platform should be a user error	2012-01-26 16:14:25 -05:00
Ryan Poplin	cdff23269d	HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close.	2012-01-26 15:56:33 -05:00
Christopher Hartl	673ceadd11	While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state.	2012-01-26 13:06:36 -05:00
Christopher Hartl	9c6fda7e15	Yup. I was right.	2012-01-26 12:54:11 -05:00
Christopher Hartl	7d059540a4	Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work. AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").	2012-01-26 12:43:52 -05:00
Ryan Poplin	25532bdc37	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-26 11:43:32 -05:00
Ryan Poplin	390d493049	Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions.	2012-01-26 11:37:08 -05:00
Eric Banks	859dd882c9	Don't make it standard for now	2012-01-26 00:38:16 -05:00
Eric Banks	c5e81be978	Adding pairwise AF table. Not polished at all, but usable none-the-less.	2012-01-26 00:37:06 -05:00
Eric Banks	702a2d768f	Initial version of multi-allelic summary module in VariantEval	2012-01-25 19:42:55 -05:00
Eric Banks	9a60887567	Lost an import in the merge	2012-01-25 19:41:41 -05:00
Eric Banks	cba5f1a8b1	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-25 19:19:03 -05:00
Eric Banks	add6918f32	Cleaner, more efficient way of determining the last dependent set in the queue.	2012-01-25 16:21:10 -05:00
Menachem Fromer	db645a94ca	Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES	2012-01-25 16:10:59 -05:00
Eric Banks	ef335a5812	Better implementation of the fix; PL index is now traversed in order.	2012-01-25 15:15:42 -05:00
Eric Banks	8e2d372ab0	Use remove instead of setting the value to null	2012-01-25 14:41:34 -05:00
Eric Banks	05816955aa	It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed.	2012-01-25 14:28:21 -05:00
Eric Banks	2799a1b686	Catch exception for bad type and throw as a TribbleException	2012-01-25 12:15:51 -05:00
Eric Banks	96b62daff3	Minor tweak to the warning message.	2012-01-25 11:55:33 -05:00
Eric Banks	fb863dc6a7	Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option.	2012-01-25 11:50:12 -05:00
Eric Banks	e349b4b14b	Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.	2012-01-25 11:35:54 -05:00
Eric Banks	ea3d4d60f2	This annotation requires rods and should be annotated as such	2012-01-25 11:35:13 -05:00
Ryan Poplin	bbefe4a272	Added option to be able to write out the active regions to an interval list file	2012-01-25 09:47:06 -05:00
Ryan Poplin	9818c69df6	Can now specify active regions to process at the command line, mainly for debugging purposes	2012-01-25 09:32:52 -05:00
Mauricio Carneiro	ffd61f4c1c	Refactor the Pileup Element with regards to indels Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion. * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements. * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions. * Every deletion now has an offset (start of the event) * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read. * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine). * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion. * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan! phew!	2012-01-24 16:07:21 -05:00
Matt Hanna	c312bd5960	Weirdly, PicardException inherits from SAMException, which means that our specialty code for reporting malformed BAMs was actually misreporting any error that happened in the Picard layer as a BAM ERROR. Specifically changing PicardException to report as a ReviewedStingException; we might want to change it in the future. I'll followup with the Picard team to make sure they really, really want PicardException to inherit from SAMException.	2012-01-24 15:30:04 -05:00
Mark DePristo	0a3172a9f1	Fix for ref 0 bases for Chris -- Disturbingly, fixing this bug doesn't actually cause an test failures. -- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly. -- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt	2012-01-24 10:55:09 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Mark DePristo	02450e4b12	Merged bug fix from Stable into Unstable	2012-01-23 12:08:39 -05:00
Christopher Hartl	798596257b	Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module.	2012-01-23 10:50:16 -05:00
Mark DePristo	80a4ce0edf	Bugfix for incorrect error messages for missing BAMs and VCFs -- Missing BAMs were appearing as StingExceptions -- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions -- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions -- Added path to standard b37 BAM to BaseTest -- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.	2012-01-23 09:52:07 -05:00
Guillermo del Angel	31d2f04368	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-23 09:23:03 -05:00
Guillermo del Angel	966387ca0b	Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring.	2012-01-23 09:22:31 -05:00
Ryan Poplin	4d6312d4ea	HaplotypeCaller is now an ActiveRegionWalker.	2012-01-22 14:31:01 -05:00
Christopher Hartl	3b1aad4f17	After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call.	2012-01-20 23:43:51 -05:00
Christopher Hartl	9b4f6afa21	Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF.	2012-01-20 23:07:59 -05:00
Ryan Poplin	4b18786b5d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 22:05:20 -05:00
Ryan Poplin	ace9333068	Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal.	2012-01-19 22:05:08 -05:00
Menachem Fromer	066da80a3d	Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output	2012-01-19 18:19:58 -05:00
Christopher Hartl	7f3ad25b01	Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF. T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.	2012-01-19 10:54:48 -05:00
Ryan Poplin	7e082c7750	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 09:11:23 -05:00
Eric Banks	ab8f499bc3	Annotate with FS even for filtered sites	2012-01-18 22:04:51 -05:00
Guillermo del Angel	b123416c4c	Resolve stale merge changes	2012-01-18 20:56:36 -05:00
Guillermo del Angel	2eb45340e1	Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet	2012-01-18 20:54:10 -05:00
Ryan Poplin	0268da7560	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-18 09:53:00 -05:00
Ryan Poplin	11982b5a34	We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN.	2012-01-18 09:42:41 -05:00
Mark DePristo	763c81d520	No longer enforce MAX_ALLELE_SIZE in VCF codec -- Instead issue a warning when a large (>1MB) record is encountered -- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()	2012-01-18 07:35:11 -05:00
Mark DePristo	0c7865fdb5	UnitTest for reverseAlleleClipping -- No code modified yet, just implementing a unit test to ensure correctness of the existing code	2012-01-18 07:35:11 -05:00
Mark DePristo	62801e430a	Bugfix for unnecessary optimization -- don't cache the ref bytes	2012-01-17 16:40:26 -05:00
Mark DePristo	f2b0575dee	Detect unreasonably large allele strings (>2^16) and throw an error -- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places. -- Tribble was updated so we actually could read the line properly (rev. to 51 here). -- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.	2012-01-17 16:40:26 -05:00
Ryan Poplin	8b0ddf0aaf	Adding notes to CountCovariates docs about using interval lists as database of known variation	2012-01-17 16:13:13 -05:00
Matt Hanna	40ebc17437	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 14:49:17 -05:00
Matt Hanna	41d70abe4e	At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings.	2012-01-17 14:47:53 -05:00
Ryan Poplin	ae259f81cc	Bug fixing for merging of read fragments when one fragment contained an indel	2012-01-17 14:39:27 -05:00
Christopher Hartl	cde224746f	Bait Redesign supports baits that overlap, by picking only the start of intervals. CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used. Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.	2012-01-17 13:51:05 -05:00
Ryan Poplin	8e23c98dd9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 13:46:28 -05:00
Matt Hanna	32ccde374b	Merged bug fix from Stable into Unstable	2012-01-17 11:08:35 -05:00
Matt Hanna	3ba918aff1	Error message cleanup in BAM indexing code.	2012-01-17 11:05:42 -05:00
Mauricio Carneiro	cec7107762	Better location for the downsampling of reads in PrintReads * using the filter() instead of map() makes for a cleaner walker. * renaming the unit tests to make more sense with the other unit and integration tests	2012-01-14 14:06:09 -05:00
Mark DePristo	b06074d6e7	Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe -- Uses PriorityBlockingQueue instead of PriorityQueue -- synchronized keywords added to all key functions that modify internal state Note that this hasn't been tested extensivesly. Based on report: http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic	2012-01-13 09:33:16 -05:00
Mauricio Carneiro	28aa353501	Added "unbiased" downsampling parameter to PrintReads * also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.	2012-01-12 16:33:55 -05:00
Matt Hanna	2c3176eb80	Merged bug fix from Stable into Unstable	2012-01-12 13:31:10 -05:00
Matt Hanna	cd43f016ce	Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior.	2012-01-12 13:29:11 -05:00
Eric Banks	ed34b4f088	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-12 10:27:26 -05:00
Eric Banks	e7fe9910f7	Create the temp storage for calculating cell values just once as per Mark's TODO	2012-01-12 10:27:10 -05:00
Eric Banks	f5f5ed5dcd	Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO	2012-01-12 08:50:03 -05:00
Eric Banks	410a340ef5	Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow.	2012-01-12 02:04:03 -05:00
Mauricio Carneiro	77a03c9709	Patching special case in the adaptor clipping * if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair. * updated md5's accordingly	2012-01-11 17:47:44 -05:00
Eric Banks	25d0d53d88	Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding.	2012-01-10 12:38:47 -05:00
Eric Banks	589397d611	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-10 12:36:48 -05:00
Matt Hanna	e923a2e512	Revving Picard to incorporate final version of ReadWalker performance improvements.	2012-01-10 12:12:33 -05:00
Eric Banks	0f36f6947e	Resolving merge conflicts	2012-01-10 11:44:16 -05:00
Eric Banks	f2cecce10f	Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).	2012-01-10 11:34:23 -05:00
Matt Hanna	509c3d87b0	Merged bug fix from Stable into Unstable	2012-01-09 23:08:46 -05:00
Matt Hanna	dc60757b68	Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed. Thanks to Tim Fennell for the bug report.	2012-01-09 23:04:53 -05:00
Matt Hanna	fda1795791	Merged bug fix from Stable into Unstable	2012-01-08 22:04:44 -05:00
Matt Hanna	1f1233b669	Fix for a rare but insidious bug in position tracking during async BAM file reading. Thanks to Khalid for spotting and reporting the issue.	2012-01-08 22:03:35 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Guillermo del Angel	d4e7655d14	Added ability to call multiallelic indels, if -multiallelic is included in UG arguments. Simple idea: we genotype all alleles with count >= minIndelCnt. To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??). More testing is required and feature turned off my default.	2012-01-06 11:24:38 -05:00
Ryan Poplin	616ff8ea01	fixed typo in help text	2012-01-06 10:36:11 -05:00
Mark DePristo	dd80ffbbbe	Merged bug fix from Stable into Unstable	2012-01-05 21:51:48 -05:00
Mark DePristo	c96fee477c	Bug fix for VariantSummary -- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie -- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data	2012-01-05 21:51:06 -05:00
Eric Banks	f5e10e9879	Merged bug fix from Stable into Unstable	2012-01-05 15:35:09 -05:00
Eric Banks	18ed954741	Compute Ti/Tv only if bi-allelic	2012-01-05 15:33:26 -05:00
Ryan Poplin	a6886a4cc0	Initial commit of the Active Region Traversal. Not ready to be used by anyone yet.	2012-01-04 17:03:21 -05:00
Guillermo del Angel	58d4539304	Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations	2012-01-04 15:28:26 -05:00
Mauricio Carneiro	9ff8a01da2	Merged bug fix from Stable into Unstable	2012-01-03 18:10:39 -05:00
Mauricio Carneiro	9b55505c03	Fixing PairHMMIndelErrorModel array out of bounds This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.	2012-01-03 18:08:46 -05:00
Christopher Hartl	2c3a9ce02f	Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-01-03 17:25:56 -05:00
David Roazen	621ee2b613	Merged bug fix from Stable into Unstable	2012-01-03 16:56:49 -05:00
Christopher Hartl	9093de1132	Cleanup: remove code to calculate the MLE AC in the UGE.	2012-01-03 15:58:51 -05:00
Christopher Hartl	2d093828a4	Final changes to Junky (been frozen for a while, but uncommitted) and the qscript for it. A first cursory implementation of the trellis-based Exact AC-constrained genotyping algorithm in UGE. Nothing calls into it, so this should be entirely safe (and, no surprise, it passes UG integration tests).	2012-01-03 15:33:04 -05:00
David Roazen	ea6e718cb8	SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline. For now, we recommend only running with the GRCh37.64 database.	2012-01-03 15:18:36 -05:00
Christopher Hartl	93e1417b6e	Update to the VSS GATK documentation.	2012-01-03 13:39:31 -05:00
Eric Banks	ab8d47d9a5	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-03 09:38:49 -05:00
Mauricio Carneiro	3d4bf273de	Added getPileupForReadGroups to ReadBackPileup * returns a pileup for all the read groups provided. * saves us from multiple calls to getPileup (which is very inefficient)	2012-01-03 09:35:11 -05:00
Mauricio Carneiro	4a208c7c06	Refactor of the downsampling machinery to accept different strategies * Implemented Adaptive downsampler * Added integration test * Added option to RRead scala script to choose downsampling strategy	2012-01-03 09:29:47 -05:00
Mauricio Carneiro	21ae3ef5f9	Added downsampling support to ReduceReads * Downsampling is now a parameter to the walker with default value of 0 (no downsampling) * Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value. * Added integration test	2012-01-03 09:29:46 -05:00
Mauricio Carneiro	cd68cc239b	Added knuth-shuffle (KS) and randomSubset using KS to MathUtils * Knuth-shuffle is a simple, yet effective array permutator (hope this is good english). * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation. * added unit tests to both functions	2012-01-03 09:29:46 -05:00
Mauricio Carneiro	94791a2a75	Add support for reads starting with insertion * Modified cleanCigarShift to allow insertions in the beginning and end of the read * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests * Updated all ReadClipper unit tests * ReduceReads does not hard clip leading insertions by default anymore * SlidingWindow adjusts start location if read starts with insertion * SlidingWindow creates an empty element with insertions to the right * Fixed all potential divide by zero with totalCount() (from BaseCounts) * Updated all Integration tests * Added new integration test for multiple interval reducing	2012-01-03 09:29:45 -05:00
Mark DePristo	d05f0c2318	GATKPerformanceOverTime script update -- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4) -- Uses 1.4 now -- By default we do 9 runs of each non-parallel test -- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number	2012-01-02 09:58:46 -05:00
Eric Banks	393993e0c7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-31 20:42:46 -05:00
Mauricio Carneiro	55cfa76cf3	Updated integration tests for the new adaptor clipping fix.	2011-12-30 18:47:14 -05:00
Mauricio Carneiro	c7d0a9ebee	Forgot to test for inter-chromosomal mates in the adaptor clipping * Fixing bug caught by Eric (and Kristian)	2011-12-30 00:19:53 -05:00
Matt Hanna	a259bfefd4	First commit addressing problems running RTC in parallel. Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming standards, the RTC has revealed a few problems with the tree reducer holding on to too much data. This is the first and smaller of two commits to reduce memory consumption. The second commit will likely be pushed after GATK1.4 is released.	2011-12-29 16:22:14 -05:00
Eric Banks	1a45ea5a05	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-29 11:37:15 -05:00
Mauricio Carneiro	f692911903	GATKSAMRecord emptyRead static constructor * Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking. * All ReadClipper utilities now emit empty reads for fully clipped reads	2011-12-27 17:01:17 -05:00
Mauricio Carneiro	8259c748f2	No more Filtered Reads tag. All synthetic reads are marked with the reduced read tag.	2011-12-27 17:01:17 -05:00
Eric Banks	d20a25d681	A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently).	2011-12-27 16:50:38 -05:00
Eric Banks	adff40ff58	Minor optimizations to avoid extra processing (esp. for reduced reads)	2011-12-27 13:16:25 -05:00
Mauricio Carneiro	17bfe48d5e	Made all class methods private in the ReadClipper * ReadClipperUnitTest now uses static methods * Haplotype caller now uses static methods * Exon Junction Genotyper now uses static methods	2011-12-27 02:11:32 -05:00
Eric Banks	dd990061f6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-26 14:45:35 -05:00
Eric Banks	2130b39f33	Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt.	2011-12-26 14:45:19 -05:00
Mauricio Carneiro	35c41409a1	Better contracts and docs for the ReadClipper * Described the ReadClipper contract in the top of the class * Added contracts where applicable * Added descriptive information to all tools in the read clipper * Organized public members and static methods together with the same javadoc	2011-12-23 19:36:57 -05:00
David Roazen	506c0e9c97	Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released and we've verified the quality of its annotations.	2011-12-23 19:12:57 -05:00
Eric Banks	24c84da60d	'Fixing' the changes in ReferenceDataSource so that a shard properly contains a list of GenomeLocs instead of a single merged one. However, that uncovered a probable bug in the engine, so instead of letting this code fester unfixed in the build (affecting everyone in the group) I've decided to revert the previous (slow, but working) version and fix the engine in my own branch.	2011-12-23 15:39:12 -05:00
Eric Banks	8762313a0d	Better TODO message	2011-12-22 20:54:35 -05:00
Eric Banks	a815e875a8	Removing debugging output	2011-12-22 15:49:11 -05:00
Eric Banks	deef542a38	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-22 15:44:58 -05:00
Eric Banks	6d260ec6ae	Start printing traversal stats after 30 seconds. I can't stand waiting 2 minutes.	2011-12-22 15:40:59 -05:00
Mauricio Carneiro	cadff40247	getRefCoordSoftUnclippedStart and End refactor These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips. * Removed from ReadUtils * Added to GATKSAMRecord * Changed name to getSoftStart() and getSoftEnd * Updated third party code accordingly.	2011-12-20 17:48:51 -05:00
Mauricio Carneiro	07128a2ad2	ReadUtils cleanup * Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now) * Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations) * Made all meaningful functions public and added better comments/explanation to the headers	2011-12-20 17:48:40 -05:00
Mauricio Carneiro	1c4774c475	Static versions of the hard clipping utilities For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods. examples: ReadClipper.hardClipLowQualEnds(2); ReadClipper.hardClipAdaptorSequence();	2011-12-20 17:48:39 -05:00
Mauricio Carneiro	f73ad1c2e2	Bugfix/Rewrite: Algorithm to determine adaptor boundaries The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative. * Fixed and rewrote for more clarity (with Ryan, Mark and Eric). * Restructured the code to handle GATKSAMRecords only * Cleaned up the other structures and functions around it to minimize clutter and potential for error. * Added unit tests for all 4 cases of adaptor boundaries.	2011-12-20 17:48:39 -05:00
Mark DePristo	0cc5c3d799	General improvements to Queue -- Support for collecting resources info from DRMAA runners -- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4 -- NCoresRequest is a testing queue script for this. -- Added two command line arguments: -- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk -- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs	2011-12-20 14:05:09 -05:00
Eric Banks	7204fcc2c3	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-20 12:59:11 -05:00
Eric Banks	8ade2d6ac2	max_alternate_alleles also ready to be made public	2011-12-20 12:59:02 -05:00
Eric Banks	6f52bd580b	--multiallelic mode is not hidden anymore (but it is annotated as advanced); added docs	2011-12-20 12:47:38 -05:00
Mauricio Carneiro	37e0044c48	Removing unclipSoftClipBases from ReadUtils * it was buggy and dangerous. * Updated Chris' code to use the ReadClipper.	2011-12-20 00:11:26 -05:00
Mauricio Carneiro	78d9bf7196	Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper * New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches. * Added functionality to clipping op to revert all soft clip bases in a read into matches * Added revertSoftClipBases function to the ReadClipper for public use * Wrote systematic unit tests	2011-12-20 00:04:30 -05:00
Christopher Hartl	24585062f8	Merge branch 'incoming'	2011-12-19 23:16:36 -05:00
Christopher Hartl	67298f8a11	AFCR made public (for use in VSS) Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.	2011-12-19 23:14:26 -05:00
Eric Banks	06d385e619	Simplifying the interface a bit	2011-12-19 15:29:46 -05:00
Christopher Hartl	339ef92eac	Goodbye SW by default. Now aligned reads that overlap intron-exon junctions are scored where they are by default, but warns the user (and flags the record in the VCF) if there's evidence to suggest that there is an indel throwing off the scoring (e.g. if the best score of a realigned unmapped read is >5 log orders better than the best score of a scored mapped read). Unmapped reads are still SW-aligned to the junction-junction sequence. This should result in a rather massive speedup, so far untested. UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)	2011-12-19 12:18:18 -05:00
Christopher Hartl	418d22b67e	Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable Conflicts: private/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IntronLossGenotyperV2.java	2011-12-19 10:59:18 -05:00
Christopher Hartl	69661da37d	Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes.	2011-12-19 10:57:28 -05:00
Laurent Francioli	16cc2b864e	- Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>	2011-12-19 10:30:59 -05:00
Eric Banks	5fd19ae734	Commented exactly how the results are represented from the exact model so developers can know how to use them.	2011-12-19 10:19:00 -05:00
Eric Banks	3069a689fe	Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case.	2011-12-19 10:04:33 -05:00
Mauricio Carneiro	5b678e3b94	Remove ClippingOp UnitTests * all testing functionality is in the ReadClipperUnitTest, no need to double test. * class and package naming cleanup	2011-12-19 07:49:26 -05:00
Matt Hanna	1ead00cac5	New fork of SamFileHeaderMerger should be cached at the thread level to enable fast (and valid) thread lookups.	2011-12-18 19:04:26 -05:00
Ryan Poplin	bc842ab3a5	Adding option to VariantAnnotator to do strict allele matching when annotating with comp track concordance.	2011-12-18 15:27:23 -05:00
Ryan Poplin	953998dcd0	Now that getSampleDB is public in the walker base class this override in VariantAnnotator isn't necessary.	2011-12-18 14:38:59 -05:00
Eric Banks	07f9d14d9f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-18 00:43:15 -05:00
Eric Banks	c5ffe0ab04	No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES.	2011-12-18 00:31:47 -05:00
Eric Banks	6dc52d42bf	Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record).	2011-12-18 00:01:42 -05:00
Khalid Shakir	7486696c07	When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter. Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory. Updated various tests for above items and other small fixes.	2011-12-16 18:09:25 -05:00
Mauricio Carneiro	fcc21180e8	Added hardClipLeadingInsertions UnitTest for the ReadClipper fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case. note: test is turned off until contract changes to allow hanging insertions (left/right).	2011-12-16 18:02:47 -05:00
Mauricio Carneiro	5bba44d693	Added hardClipByReferenceCoordinates UnitTest for the ReadClipper * fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail. * fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail. * fixed AlignmentStart of a clipped read that results in only hard clips and soft clips note: added tests to all these beautiful cases...	2011-12-16 18:01:33 -05:00
Mark DePristo	1994c3e3bc	Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants	2011-12-16 16:50:51 -05:00
Mark DePristo	b6067be952	Support for selecting only variants with specific IDs from a file in SelectVariants -- Cleaned up unused variables as well	2011-12-16 16:50:39 -05:00
Mark DePristo	d6d2f49c88	Don't print log if there are no BAMs	2011-12-16 16:50:36 -05:00
Mark DePristo	78e0950a77	Minor bug fix for printing in SAMDataSource	2011-12-16 11:45:40 -05:00
Mark DePristo	7bc0d18418	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-16 11:42:42 -05:00
Ryan Poplin	5aa79dacfc	Changing hidden optimization argument to advanced.	2011-12-16 10:29:20 -05:00
Matt Hanna	3642a73c07	Performance improvements for dynamically merging BAMs in read walkers. This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads. Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with Tim&Co to produce a more maintainable patch.	2011-12-16 09:37:44 -05:00
Mark DePristo	3414ecfe2e	Restored serial version of reader initialization. Serial mode is default, as the performance gains aren't so huge. -- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version -- Comparison of serial and parallel reader with cached and uncached files: Initialization time: serial with 500 fully cached BAMs: 8.20 seconds Initialization time: serial with 500 uncached BAMs : 197.02 seconds Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds Initialization time: parallel with 500 uncached BAMs : 75.47 seconds	2011-12-16 09:22:10 -05:00
Mark DePristo	fb1c9d2abc	Restored serial version of reader initialization. Parallel mode is default. -- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version	2011-12-16 09:05:28 -05:00
Mauricio Carneiro	e61e5c7589	Refactor of ReadClipper unit tests * expanded the systematic cigar string space test framework Roger wrote to all tests * moved utility functions into Utils and ReadUtils * cleaned up unused classes	2011-12-15 19:05:43 -05:00
Mauricio Carneiro	4748ae0a14	Bugfix: Softclips before Hardclips weren't being accounted for caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it. * updated unit test for hardClipSoftClippedBases with corresponding test-case.	2011-12-15 12:17:25 -05:00
Mauricio Carneiro	62a2e335bc	Changing HardClipper contract to allow UNMAPPED reads shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.	2011-12-15 11:08:19 -05:00
Matt Hanna	9333b678b5	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-14 18:05:44 -05:00
Matt Hanna	6fb4be1a09	Cache header merger.	2011-12-14 18:05:31 -05:00
Mauricio Carneiro	128bdf9c09	Create artificial reads with "default" parameters * added functions to create synthetic reads for unit testing with reasonable default parameters * added more functions to create synthetic reads based on cigar string + bases and quals.	2011-12-14 16:58:14 -05:00
Mauricio Carneiro	c85100ce9c	Fix ClippingOp bug when performing multiple hardclip ops bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds. fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).	2011-12-14 16:57:47 -05:00
Eric Banks	de5928ac5a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-14 16:24:56 -05:00
Eric Banks	4fddac9f22	Updating busted integration tests	2011-12-14 16:24:43 -05:00
Mark DePristo	01e547eed3	Parallel SAMDataSource initialization -- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount -- Added logger.info message noting progress and cost of reading low-level BAM data.	2011-12-14 16:14:26 -05:00
Mark DePristo	71b4bb12b7	Bug fix for incorrect logic in subsetSamples -- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list) -- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples. -- Unit tests added to handle these cases	2011-12-14 16:14:26 -05:00
Eric Banks	35fc2e13c3	Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.	2011-12-14 15:31:09 -05:00
Eric Banks	1e90d602a4	Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.	2011-12-14 13:38:20 -05:00
Eric Banks	988d60091f	Forgot to add in the new result class	2011-12-14 13:37:15 -05:00
Eric Banks	106bf13056	Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.	2011-12-14 12:05:50 -05:00
Eric Banks	7648521718	Add check for mixed genotype so that we don't exception out for a valid record	2011-12-14 11:26:43 -05:00
Eric Banks	9497e9492c	Bug fix for complex records: do not ever reverse clip out a complete allele.	2011-12-14 11:21:28 -05:00
Eric Banks	09a5a9eac0	Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.	2011-12-14 10:43:52 -05:00
Eric Banks	d3f4a5a901	Fail gracefully when encountering malformed VCFs without enough data columns	2011-12-14 10:37:38 -05:00
Eric Banks	079932ba2a	The log10cache needs to be larger if we want to handle 10K samples in the UG.	2011-12-13 23:36:10 -05:00
Ryan Poplin	7fa1ab1bae	Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test	2011-12-13 17:19:40 -05:00
Eric Banks	e47a113c9f	Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?	2011-12-12 23:02:45 -05:00
Mauricio Carneiro	5cc1e72fdb	Parallelized SelectVariants * can now use -nt with SelectVariants for significant speedup in large files * added parallelization integration tests for SelectVariants	2011-12-12 18:41:14 -05:00
Mauricio Carneiro	a70a0f25fb	Better debug output for SAMDataSource output the name and number of the files being loaded by the GATK instead of "coordinate sorted".	2011-12-12 17:57:29 -05:00
Mark DePristo	d03425df2f	TODO optimization targets	2011-12-12 17:39:51 -05:00
Laurent Francioli	025bdfe2cc	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-12 12:19:44 +01:00
Eric Banks	7b6338c742	Merge branch 'master' into trialleles	2011-12-11 00:28:46 -05:00
Eric Banks	7c4b9338ad	The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.	2011-12-11 00:23:33 -05:00
Eric Banks	044f211a30	Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.	2011-12-10 23:57:14 -05:00
Eric Banks	364f1a030b	Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.	2011-12-09 14:25:28 -05:00
Eric Banks	64dad13e2d	Don't carry around an extra copy of the code for the Haplotype Caller	2011-12-09 11:09:40 -05:00
Eric Banks	442ceb6ad9	The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.	2011-12-09 10:16:44 -05:00
Laurent Francioli	a79144f7db	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-09 15:57:24 +01:00
Laurent Francioli	5a06170804	Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.	2011-12-09 14:51:34 +01:00
Eric Banks	aa4a8c5303	No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).	2011-12-09 02:25:06 -05:00
Eric Banks	8777288a9f	Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.	2011-12-09 00:00:20 -05:00
Eric Banks	3e7714629f	Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.	2011-12-08 23:50:54 -05:00
Eric Banks	4aebe99445	Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.	2011-12-08 15:31:02 -05:00
Eric Banks	7750bafb12	Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.	2011-12-08 13:50:50 -05:00
Guillermo del Angel	252e0f3d0a	Merged bug fix from Stable into Unstable	2011-12-08 13:11:39 -05:00
Guillermo del Angel	1bfe28067f	Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.	2011-12-08 12:54:08 -05:00
Mark DePristo	9def841275	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-07 13:36:16 -05:00
Mark DePristo	4055877708	Prints 0.0 TiTv not NaN when there are no variants -- Updated md5	2011-12-07 12:07:54 -05:00
Matt Hanna	15533e08df	Fixed issue with RODWalker parallelization. Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification. When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part of the async I/O project, I inadvertently reenabled this specifier.	2011-12-07 11:55:42 -05:00
Mark DePristo	5d2212bc8e	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-07 09:03:17 -05:00
Mark DePristo	6bf18899df	Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs	2011-12-07 09:02:49 -05:00
Matt Hanna	c9b2cd8ba5	Fix for chartl's stale null representation issue.	2011-12-06 18:05:17 -05:00
Eric Banks	79d18dc078	Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.	2011-12-06 16:17:18 -05:00
Matt Hanna	f5b977fc88	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-06 10:11:35 -05:00
Matt Hanna	4001c22a11	Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.	2011-12-06 10:10:38 -05:00
Khalid Shakir	677bea0abd	Right aligning GATKReport numeric columns and updated MD5s in tests. PreQC parses file with spaces in sample names by using tabs only. PostQC allows passing the file names for the evals so that flanks can be evaled. BaseTest's network temp dir now adds the user name to the path so files aren't created in the root. HybridSelectionPipeline: - Updated to latest versions of reference data. - Refactored Picard parsing code replacing YAML.	2011-12-05 23:22:15 -05:00
Eric Banks	7a0f6feda4	Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.	2011-12-05 16:18:52 -05:00
Eric Banks	7fac4afab3	Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).	2011-12-05 15:57:25 -05:00
Eric Banks	a7cb941417	The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.	2011-12-04 13:02:53 -05:00
Eric Banks	eab2b76c9b	Added loads of comments for future reference	2011-12-03 23:54:42 -05:00
Eric Banks	29662be3d7	Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).	2011-12-03 23:12:04 -05:00
Eric Banks	71f793b71b	First partially working version of the multi-allelic version of the Exact AF calculation	2011-12-02 14:13:14 -05:00
David Roazen	d014c7faf9	Queue now properly escapes all shell arguments in generated shell scripts This has implications for both Qscript authors and CommandLineFunction authors. Qscript authors: You no longer need to (and in fact must not) manually escape String values to avoid interpretation by the shell when setting up Walker parameters. Queue will safely escape all of your Strings for you so that they'll be interpreted literally. Eg., Old way: filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"") New way: filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0") CommandLineFunction authors: If you're writing a one-off CommandLineFunction in a Qscript and don't really care about quoting issues, just keep doing things the direct, simple way: def commandLine = "cat %s \| grep -v \"#\" > %s".format(files, out) If you're writing a CommandLineFunction that will become part of Queue and will be used by other QScripts, however, it's advisable to do things the newer, safer way, ie.: When you construct your commandLine, you should do so ONLY using the API methods required(), optional(), conditional(), and repeat(). These will manage quoting and whitespace separation for you, so you shouldn't insert quotes/extraneous whitespace in your Strings. By default you get both (quoting and whitespace separation), but you can disable either of these via parameters. Eg., override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted required(outVcf) I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new system, so you'll get free shell escaping when you use those in Qscripts just like with walkers.	2011-12-01 18:13:44 -05:00
Mark DePristo	3060a4a15e	Support for list of known CNVs in VariantEval -- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument -- Genericizes loading for intervals into interval tree by chromosome -- GenomeLoc methods for reciprocal overlap detection, with unit tests	2011-11-30 17:05:16 -05:00
Matt Hanna	b65db6a854	First draft of a test script for I/O performance with the new asynchronous I/O processing. Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when Queue gets better support for tagged arguments (hopefully soon).	2011-11-30 13:13:16 -05:00
Laurent Francioli	1d5d200790	Cleaned up unused import statements	2011-11-30 15:30:30 +01:00
Mark DePristo	28b286ad39	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-30 09:11:53 -05:00
Laurent Francioli	20bffe0430	Adapted for the new version of MendelianViolation	2011-11-30 14:46:38 +01:00
Laurent Francioli	1cb5e9e149	Removed outdated (and unused) -familyStr commandline argument	2011-11-30 14:45:04 +01:00
Laurent Francioli	f49dc5c067	Added functionality to get all children that have both parents (useful when trios are needed)	2011-11-30 14:43:37 +01:00
Laurent Francioli	a4606f9cfe	Merge branch 'MendelianViolation' Conflicts: public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java	2011-11-30 11:13:15 +01:00
Laurent Francioli	b279ae4ead	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-30 10:10:21 +01:00
Ryan Poplin	91413cf0d9	Merged bug fix from Stable into Unstable	2011-11-29 14:01:23 -05:00
Ryan Poplin	cb284eebde	Further updating VQSR tutorial wiki docs to reflect the bundle	2011-11-29 14:00:57 -05:00
Ryan Poplin	dcb889665d	Merged bug fix from Stable into Unstable	2011-11-29 09:58:49 -05:00
Ryan Poplin	447e9bff9e	Updating VQSR tutorial wiki docs to reflect the bundle	2011-11-29 09:57:45 -05:00
Ryan Poplin	110298322c	Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it.	2011-11-29 09:29:18 -05:00
Laurent Francioli	ab67011791	Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified	2011-11-29 11:18:15 +01:00
Eric Banks	d7d8b8e380	Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.	2011-11-28 14:18:28 -05:00
Laurent Francioli	a09c01fcec	Removed walker argument FamilyStructure as this is now supported by the engine (ped file)	2011-11-28 17:18:11 +01:00
Laurent Francioli	795c99d693	Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too. MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs	2011-11-28 17:13:14 +01:00
Laurent Francioli	e877db8f42	Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too.	2011-11-28 17:11:30 +01:00
Laurent Francioli	5c2595701c	Added a function to get families only for a given list of samples.	2011-11-28 17:10:33 +01:00
Mark DePristo	3c36428a20	Bug fix for TiTv calculation -- shouldn't be rounding	2011-11-28 10:20:34 -05:00
Eric Banks	436b4dc855	Updated docs	2011-11-28 08:59:48 -05:00
Laurent Francioli	b1dd632d5d	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable Conflicts: public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java	2011-11-25 16:16:44 +01:00
Mark DePristo	e319079c32	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-23 13:02:11 -05:00
Mark DePristo	4107636144	VariantEval updates -- Performance optimizations -- Tables now are cleanly formatted (floats are %.2f printed) -- VariantSummary is a standard report now -- Removed CompEvalGenotypes (it didn't do anything) -- Deleted unused classes in GenotypeConcordance -- Updates integration tests as appropriate	2011-11-23 13:02:07 -05:00
David Roazen	e5b85f0a78	A toString() method for IntervalBindings Necessary since we're currently writing things like this to our VCF headers: intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]	2011-11-23 11:56:12 -05:00
Mark DePristo	5a4856b82e	GATKReports now support a format field per column -- You can tell the table to format your object with "%.2f" for example.	2011-11-23 11:31:04 -05:00
Mark DePristo	c8bf7d2099	Check for null comment	2011-11-23 10:47:21 -05:00
Mark DePristo	6c2555885c	Caching getSimpleName() in VariantEval is a big performance improvement -- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table	2011-11-23 08:34:05 -05:00
Guillermo del Angel	32adbd614f	Solve merge conflict	2011-11-22 22:48:46 -05:00
Guillermo del Angel	941f3784dc	Solve merge conflict	2011-11-22 22:48:03 -05:00
Guillermo del Angel	75d93e6335	Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left	2011-11-22 22:46:12 -05:00
Mark DePristo	a3aef8fa53	Final performance optimization for GenotypesContext	2011-11-22 17:19:30 -05:00
Mark DePristo	990c02e4de	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-22 17:19:11 -05:00
Guillermo del Angel	38a90da92c	Fixed merge conflict to Unstable	2011-11-22 14:39:45 -05:00
Guillermo del Angel	32a77a8a56	Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle.	2011-11-22 13:57:24 -05:00
Eric Banks	5821c11fad	For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead.	2011-11-22 10:50:22 -05:00
Mark DePristo	7087310373	Embarassing bug fixed	2011-11-22 10:16:36 -05:00
Mark DePristo	e484625594	GenotypesContext now updates cached data for add, set, replace operations when possible -- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system	2011-11-22 08:40:48 -05:00
Mark DePristo	2b51c01df4	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-21 19:16:06 -05:00
Mark DePristo	5443d3634a	Again, fixing the add call when we really mean replace -- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change -- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.	2011-11-21 19:15:56 -05:00
Mauricio Carneiro	5ad3dfcd62	BugFix: byte overflow in SyntheticRead compressed base counts * fixed and added unit test	2011-11-21 17:11:50 -05:00
Mark DePristo	9ea7b70a02	Added decode method to LazyGenotypesContext -- AbstractVCFCodec calls this if the samples are not sorted. Previously called getGenotypes() which didn't actually trigger the decode	2011-11-21 16:21:23 -05:00
Mark DePristo	ab2efe3bd3	Reverting bad exact model changes	2011-11-21 16:14:40 -05:00
Eric Banks	44554b2bfd	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-21 15:01:45 -05:00
Eric Banks	022832bd74	Very bad use of the == operator with Strings was ensuring that validating GenomeLocs was very inefficient. This fix resulted in a significant speedup for a simple RodWalker.	2011-11-21 14:49:47 -05:00
Mark DePristo	1561af22af	Exact model code cleanup -- Fixed up code when fixing a bug detected by aggressive contracts in GenotypesContext.	2011-11-21 14:35:15 -05:00
Mark DePristo	2c501364b8	GenotypesContext no longer have immutability in constructor -- additional bug fixes throughout VariantContext and GenotypesContext objects	2011-11-21 14:34:31 -05:00
David Roazen	1296dd41be	Removing the legacy -L "interval1;interval2" syntax This syntax predates the ability to have multiple -L arguments, is inconsistent with the syntax of all other GATK arguments, requires quoting to avoid interpretation by the shell, and was causing problems in Queue. A UserException is now thrown if someone tries to use this syntax.	2011-11-21 13:18:53 -05:00
Mark DePristo	e467b8e1ae	More contracts on LazyGenotypesContext	2011-11-21 09:34:57 -05:00
Mark DePristo	2e9ecf639e	Generalized interface to LazyGenotypesContext -- Now you provide a LazyParsing object -- LazyGenotypesContext now knows nothing about the VCF parser itself. The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object -- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version -- Deleted VCFParser interface, as it was no longer necessary	2011-11-21 09:30:40 -05:00
Mark DePristo	bc44f6fd9e	Utility function Collection<Genotype> -> Collection<String>	2011-11-20 18:26:56 -05:00
Mark DePristo	9445326c6c	Genotype is Comparable via sampleName	2011-11-20 18:26:27 -05:00
Mark DePristo	f9e25081ab	Completed documented LazyGenotypesContext	2011-11-20 08:35:52 -05:00
Mark DePristo	9cb3fe3a59	Vastly better way of doing on-demand genotyping loading -- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading. -- This new class was replaced all of the old, complex functionality -- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency. This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily -- Misc. bug fixes throughout the system -- Bug fixes for PhaseByTransmission with new GenotypesContext	2011-11-20 08:23:09 -05:00
Mark DePristo	f392d330c3	Proper use of builder. Previous conversion attempt was flawed	2011-11-19 22:09:56 -05:00
Mark DePristo	7d09c0064b	Bug fixes and code cleanup throughout -- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.	2011-11-19 18:40:15 -05:00
Mark DePristo	8f7eebbaaf	Bugfix for pError not being checked correctly in CommonInfo -- UnitTests to ensure correct behavior -- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered	2011-11-19 15:58:59 -05:00
Mark DePristo	b7b57ef39a	Updating MD5 to reflect canonical ordering of calculation -- We should no longer have md5s changing because of hashmaps changing their sort order on us -- Added GenotypeLikelihoodsUnitTests -- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.	2011-11-19 15:57:33 -05:00
Mark DePristo	73119c8e3c	Merge with master -- A few bug fixes	2011-11-19 09:56:06 -05:00
Mark DePristo	f685fff79b	Killing the final versions of old new VariantContext interface	2011-11-18 21:32:43 -05:00
Mark DePristo	6cf315e17b	Change interface to getNegLog10PError to getLog10PError	2011-11-18 21:07:30 -05:00
Mark DePristo	c7f2d5c7c7	Final minor fix to contract	2011-11-18 19:40:05 -05:00
Mauricio Carneiro	b5de182014	isEmpty now checks if mReadBases is null Since newly created reads have mReadBases == null. This is an effort to centralize the place to check for empty GATKSAMRecords.	2011-11-18 18:34:05 -05:00
Mauricio Carneiro	8ab3ee9c65	Merge remote-tracking branch 'unstable/master' into rr	2011-11-18 16:50:25 -05:00
Mauricio Carneiro	333e5de812	returning read instead of GATKSAMRecord Do not create new GATKSAMRecord when read has been fully clipped, because it is essentially the same as returning the currently fully clipped read.	2011-11-18 16:49:59 -05:00
Matt Hanna	8bb4d4dca3	First pass of the asynchronous block loader. Block loads are only triggered on queue empty at this point. Disabled by default (enable with nt:io=?).	2011-11-18 15:02:59 -05:00
Mark DePristo	a2e79fbe8a	Fixes to contracts	2011-11-18 14:18:53 -05:00
Mark DePristo	660d6009a2	Documentation and contracts for GenotypesContext and VariantContextBuilder	2011-11-18 13:59:30 -05:00
Mark DePristo	f54afc19b4	VariantContextBuilder -- New approach to making VariantContexts modeled on StringBuilder -- No more modify routines -- use VariantContextBuilder -- Renamed isPolymorphic to isPolymorphicInSamples. Same for mono -- getChromosomeCount -> getCalledChrCount -- Walkers changed to use new VariantContext. Some deprecated new VariantContext calls remain -- VCFCodec now uses optimized cached information to create GenotypesContext.	2011-11-18 12:39:10 -05:00
Eric Banks	6459784351	Merged bug fix from Stable into Unstable	2011-11-18 12:34:57 -05:00
Eric Banks	c62082ba1b	Making this class public again as per request from Cancer folks	2011-11-18 12:34:27 -05:00
Eric Banks	8710673a97	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-18 12:29:33 -05:00
Eric Banks	768b27322b	I figured out why we were getting tons of hom var genotype calls with Mauricio's low quality (synthetic) reduced reads: the RR implementation in the UG was not capping the base quality by the mapping quality, so all the low quality reads were used to generate GLs. Fixed.	2011-11-18 12:29:15 -05:00
Mark DePristo	7490dbb6eb	First version of VariantContextBuilder	2011-11-18 11:06:15 -05:00
Roger Zurawicki	f48d4cfa79	Bug fix: fully clipping GATKSAMRecords and flushing ops Reads that are emptied after clipping become new GATKSAMRecords. When applying ClippingOps, the ops are cleared after the clipping	2011-11-18 00:24:39 -05:00
Mark DePristo	fa454c88bb	UnitTests for VariantContext for chrCount, getSampleNames, Order function -- Major change to how chromosomeCounts is computed. Now NO_CALL alleles are always excluded. So ChromosomeCounts(A/.) is 1, the previous result would have been 2. -- Naming changes for getSamplesNameInOrder()	2011-11-17 20:37:22 -05:00
Mark DePristo	23359d1c6c	Bugfix for pruneVariantContext, which was dropping the ref base for padding	2011-11-17 15:32:52 -05:00
Mark DePristo	473b860312	Major determinism fix for UG and RankSumTest -- Now these routines all iterate in sample name order (genotypes.iterateInSampleNameOrder) so that the results of UG and the annotator do not depend on the particular order of samples we see for the exact model and the RankSumTest	2011-11-17 15:31:45 -05:00
Khalid Shakir	c50274e02e	During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice. Moved flanking interval utilies to IntervalUtils with UnitTests.	2011-11-17 13:56:42 -05:00
Eric Banks	16a021992b	Updated header description for the INFO and FORMAT DP fields to be more accurate.	2011-11-17 13:17:53 -05:00
Eric Banks	e7d41d8d33	Minor cleanup	2011-11-17 12:00:28 -05:00
Mark DePristo	7e66677769	Expanded UnitTests for VariantContext Tests for -- getGenotype and getGenotypes -- subContextBySample -- modify routines	2011-11-16 20:45:15 -05:00
Mark DePristo	aa0610ea92	GenotypeCollection renamed to GenotypesContext	2011-11-16 16:24:05 -05:00
Mark DePristo	caf6080402	Better algorithm for merging genotypes in CombineVariants	2011-11-16 15:17:33 -05:00
Mark DePristo	e56d52006a	Continuing bugfixes to get new VC working	2011-11-16 10:39:17 -05:00
Matt Hanna	eb8e031f75	Merged bug fix from Stable into Unstable	2011-11-16 09:57:37 -05:00
Matt Hanna	6a5d5e7ac9	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/stable	2011-11-16 09:57:13 -05:00
Matt Hanna	7ac5cf8430	Getting rid of unsupported CountReadPairs walker in stable. Removal of remainder of pairs processing framework to follow in unstable.	2011-11-16 09:53:59 -05:00
Eric Banks	c2ebe58712	Merge remote-tracking branch 'Laurent/master'	2011-11-16 09:34:47 -05:00
Laurent Francioli	0dc3d20d58	Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type	2011-11-16 09:33:13 +01:00
Laurent Francioli	7d77fc51f5	Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type	2011-11-16 03:32:43 -05:00
David Roazen	0d163e3f52	SnpEff 2.0.4 support -Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format -Assigning functional classes and effect impacts now handled directly by SnpEff rather than the GATK -Removed support for SnpEff 2.0.2, as we no longer trust the output of that version since it doesn't exclude effects associated with certain nonsensical transcripts. These effects are excluded as of 2.0.4. -Updated unit and integration tests This support is based on a release-candidate of SnpEff 2.0.4, and so is subject to change between now and the next GATK release.	2011-11-15 18:36:22 -05:00
Mark DePristo	df415da4ab	More bug fixes on the way to passing all tests	2011-11-15 17:38:12 -05:00
Mark DePristo	0be23aae4e	Bugfixes on way to a working refactored VariantContext	2011-11-15 17:20:14 -05:00
Mark DePristo	231c47c039	Bugfixes on way to a working refactored VariantContext	2011-11-15 16:42:50 -05:00
Laurent Francioli	fb685f88ec	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-15 16:23:53 -05:00
Mark DePristo	2b2514dad2	Moved many unused phasing walkers and utilities to archive	2011-11-15 16:14:50 -05:00
Mark DePristo	460a51f473	ID field now stored in the VariantContext itself, not the attributes	2011-11-15 14:56:33 -05:00
Eric Banks	b45d10e6f1	The DP in the FORMAT field (per sample) must also use the representative count or else it's always 1 for reduced reads.	2011-11-15 10:23:59 -05:00
Mark DePristo	233e581828	Merging in Master	2011-11-15 09:28:24 -05:00
Eric Banks	b66556f4a0	Update error message so that it's clear ReadPair Walkers are exceptions	2011-11-15 09:22:57 -05:00
Mark DePristo	6e1a86bc3e	Bug fixes to VariantContext and GenotypeCollection	2011-11-15 09:21:30 -05:00
Mauricio Carneiro	cde829899d	compress Reduce Read counts bytes by offset compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size. Example compression --> from : 10, 10, 11, 11, 12, 12, 12, 11, 10 to: 10, 0, 1, 1,2, 2, 2, 1, 0	2011-11-14 18:30:24 -05:00
Mark DePristo	f0234ab67f	GenotypeMap -> GenotypeCollection part 2 -- Code actually builds	2011-11-14 17:42:55 -05:00
David Roazen	ab0ee9b847	Perform only necessary validation in VariantContext modify methods	2011-11-14 16:49:59 -05:00
Mark DePristo	2e9d5363e7	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-14 15:32:06 -05:00
Mark DePristo	1fbdcb4f43	GenotypeMap -> GenotypeCollection	2011-11-14 15:32:03 -05:00
Eric Banks	4dc9dbe890	One quick fix to previous commit	2011-11-14 14:42:12 -05:00
Eric Banks	7b2a7cfbe7	Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes.	2011-11-14 14:31:27 -05:00
Mark DePristo	9b5c79b49d	Renamed InferredGeneticContext to CommonInfo -- I have no idea why I named this InferredGeneticContext, a totally meaningless term -- Renamed to CommonInfo. -- Made package protected, as no one should use this outside of VariantContext and Genotype -- UGEngine was using IGC constant, but it's now using the public one in VariantContext.	2011-11-14 14:28:52 -05:00
Mark DePristo	077397cb4b	Deleted MutableVariantContext -- All methods that used this capable now use VariantContext directly instead	2011-11-14 14:19:06 -05:00
Mark DePristo	b11c535527	Deleted MutableGenotype -- This class wasn't really used anywhere, and so removed to control code bloat.	2011-11-14 13:16:36 -05:00
Mark DePristo	79987d685c	GenotypeMap contains a Map, not extends it -- On path to replacing it with GenotypeCollection	2011-11-14 12:55:03 -05:00
Eric Banks	7aee80cd3b	Fix to deal with reduced reads containing a deletion	2011-11-14 12:23:46 -05:00
Eric Banks	3d2970453b	Misc minor cleanup	2011-11-14 09:41:54 -05:00
Laurent Francioli	1347beef40	Merge branch 'PhaseByTransmission'	2011-11-14 11:31:28 +01:00
Eric Banks	b7c33116af	Minor docs update	2011-11-12 23:21:07 -05:00
Eric Banks	76d357be40	Updating docs example to use -L since that's best practice	2011-11-12 23:20:05 -05:00
Mark DePristo	fee9b367e4	VariantContext genotypes are now stored as GenotypeMap objects -- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples -- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type. -- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous. Now everything uses GenotypeMap with a specific ordering of samples (by name) -- Integrationtests updated and all pass	2011-11-11 15:00:35 -05:00
Guillermo del Angel	cd3146f4cf	Add hidden option to ValidationAmplicons to output slightly modified format to make file work with downstream SQNM tools more seamlessly at request of GAP: one line per record, keep probe identifier to 20 characters, no * in ref allele.	2011-11-11 14:07:07 -05:00
Ryan Poplin	40fbeafa37	VQSR will now detect if the negative model failed to converge properly because of having too few data points and automatically retry with more appropriate clustering parameters.	2011-11-11 11:52:30 -05:00
Mark DePristo	ef9f8b5d46	Added subContextOfSamples to VariantContext -- This is a more convenient accesssor than subContextOfGenotypes, represents nearly all of the use cases of the former function, and potentially can be implemented more efficiently.	2011-11-11 10:07:11 -05:00
Mark DePristo	ee40791776	Attributes are now Map<String,Object> not Map<String,?> -- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).	2011-11-11 09:55:42 -05:00
Mark DePristo	dc9b351b5e	Meaningful error message when an IntervalArg file fails to parse correctly	2011-11-10 17:10:26 -05:00
Mark DePristo	bb7bf74aa8	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-10 16:05:43 -05:00
Mauricio Carneiro	060c7ce8ae	It wouldn't harm integrationtests if we had our logic right... :-)	2011-11-10 14:03:22 -05:00
Eric Banks	39678b6a20	Check for reads with missing read groups and throw a UserException when encountered. Mauricio said this wouldn't break integration tests.	2011-11-10 13:34:45 -05:00
Mark DePristo	dd1810140f	-stratIntervals is optional	2011-11-10 13:27:32 -05:00
Mark DePristo	67b022c34b	Cleanup for new SampleUtils function -- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase	2011-11-10 13:27:13 -05:00
Mark DePristo	35fe9c8a06	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-10 11:11:33 -05:00
Mark DePristo	dc4932f93d	VariantEval module to stratify the variants by whether they overlap an interval set The primary use of this stratification is to provide a mechanism to divide asssessment of a call set up by whether a variant overlaps an interval or not. I use this to differentiate between variants occurring in CCDS exons vs. those in non-coding regions, in the 1000G call set, using a command line that looks like: -T VariantEval -R human_g1k_v37.fasta -eval 1000G.vcf -stratIntervals:BED ccds.bed -ST IntervalStratification Note that the overlap algorithm properly handles symbolic alleles with an INFO field END value. In order to safely use this module you should provide entire contigs worth of variants, and let the interval strat decide overlap, as opposed to using -L which will not properly work with symbolic variants. Minor improvements to create() interval in GenomeLocParser.	2011-11-10 10:58:40 -05:00
Mauricio Carneiro	0d8983feee	outputting the RG information setReadGroup now sets the read group attribute for the GATKSAMRecord	2011-11-09 23:35:00 -05:00
Eric Banks	315ac68b0b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-09 22:37:36 -05:00
Eric Banks	6313aae2c4	Adding checks for hasBasePileup() before calling getBasePileup() as per GS thread	2011-11-09 22:37:26 -05:00
Ryan Poplin	74a18d3de8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-09 22:29:40 -05:00
Ryan Poplin	24712c0221	Merged bug fix from Stable into Unstable	2011-11-09 22:28:27 -05:00
Ryan Poplin	8942406aa2	Use MathUtils to compare doubles instead of testing for equality	2011-11-09 22:05:21 -05:00
Ryan Poplin	348f2db7fd	Fix for HMM optimization. If the two penalty arrays match exactly the function should return the end of the array instead of 0.	2011-11-09 22:00:52 -05:00
Eric Banks	82bf09edf3	Mark Standard Annotations with an asterisk	2011-11-09 20:42:31 -05:00
Eric Banks	04b122be29	Fix for bug reported on GetSatisfaction	2011-11-09 20:33:36 -05:00
Mauricio Carneiro	d00b2c6599	Adding a synthetic read for filtered data * Generalized the concept of a synthetic read to cread both running consensus and a synthetic reads of filtered data. * Synthetic reads can now have deletions (but not insertions) * New reduced read tag for filtered data synthetic reads (RF) * Sliding window header now keeps information of consensus and filtered data * Synthetic reads are created simultaneously, new functionality is controlled internally by addToSyntheticReads	2011-11-09 20:16:22 -05:00
Eric Banks	21bf43f3bb	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-09 15:34:40 -05:00
Christopher Hartl	85bffe1dca	Merged bug fix from Stable into Unstable	2011-11-09 15:29:14 -05:00
Christopher Hartl	d828eba7f4	Allow comments in a table-formatted file to precede the header line.	2011-11-09 15:27:38 -05:00
Eric Banks	8205efbb29	Merge branch 'master' into intervals	2011-11-09 15:27:15 -05:00
Eric Banks	d64f8a89a9	Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly.	2011-11-09 15:24:29 -05:00
Mauricio Carneiro	f080f64f99	Preserve RG information on new GATKSAMRecord from SAMRecord	2011-11-09 14:39:20 -05:00
Mauricio Carneiro	f9530e0768	Clean unnecessary attributes from the read this gives on average 40% file size reduction.	2011-11-09 14:39:20 -05:00
Mauricio Carneiro	9427ada498	Fixing no cigar bug empty GATKSAMRecords will have a null cigar. Treat them accordingly.	2011-11-09 14:39:20 -05:00
Mark DePristo	e639f0798e	mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set -- A bit of code cleanup in VCFUtils -- VariantEval table to create 1000G Phase I variant summary table -- First version of 1000G Phase I summary table Qscript	2011-11-09 14:35:50 -05:00
Christopher Hartl	149b79eaad	Merged bug fix from Stable into Unstable	2011-11-09 11:26:30 -05:00
Christopher Hartl	11abb4f9d1	Better error message.	2011-11-09 11:25:28 -05:00
Christopher Hartl	d3a533b82e	Revert "a" This reverts commit 1175f50ddbf389f5da74d27dc725596582ae15af.	2011-11-09 11:22:26 -05:00
Christopher Hartl	5eaf800281	a	2011-11-09 11:22:20 -05:00
Christopher Hartl	5451fbc2b2	Merged bug fix from Stable into Unstable	2011-11-09 11:06:15 -05:00
Christopher Hartl	091229e4db	MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods. Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten. Table feature now has a toString method. These are minor bug fixes.	2011-11-09 11:03:29 -05:00
Mauricio Carneiro	e1b4c3968f	Fixing GATKSAMRecord bug when constructing a GATKSAMRecord from scratch, we should set "mRestOfBinaryData" to null so the BAMRecord doesn't try to retrieve missing information from the non-existent bam file.	2011-11-08 16:50:36 -05:00
Ryan Poplin	e973ca2010	fixing merge conflict.	2011-11-08 14:55:05 -05:00
Ryan Poplin	b0e6afec48	Bug fix for HMM optimization. Need to also check the gap continuation penalty array for the index with the first discrepancy.	2011-11-08 14:51:25 -05:00
Laurent Francioli	571c724cfd	Added reporting of the number of genotypes updated.	2011-11-08 15:15:51 +01:00
Ryan Poplin	94dc447a70	Merged bug fix from Stable into Unstable	2011-11-07 15:26:35 -05:00
Ryan Poplin	0b181be61f	Bug fix in SelectVariants when using a discordance track but no sample specifications. Added integration test to test this.	2011-11-07 15:25:16 -05:00
Ryan Poplin	0534149708	Merged bug fix from Stable into Unstable	2011-11-07 14:07:08 -05:00
Ryan Poplin	2d1e385ca4	Adding note to VQSR docs about Rscript being needed in the environment PATH.	2011-11-07 14:04:13 -05:00
Eric Banks	759f4fe6b8	Moving unclaimed walker with bad integration test to archive	2011-11-07 13:16:38 -05:00
Eric Banks	c1986b6335	Add notes to the GATKdocs as to when a particular annotation can/cannot be calculated.	2011-11-07 11:06:19 -05:00
Eric Banks	724e3f3b0d	Merged bug fix from Stable into Unstable	2011-11-06 22:23:22 -05:00
Eric Banks	cdd40d1222	Removing contracts for the SimpleTimer	2011-11-06 22:22:49 -05:00
Ryan Poplin	5c565d28b9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-06 10:26:19 -05:00
Eric Banks	1c4e429a1c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-06 00:05:56 -04:00
Eric Banks	a12bc63e5c	Get rid of support for bams without sample information in the read groups. This hidden option wasn't being used anyways because it wasn't hooked up properly in the AlignmentContext.	2011-11-05 23:54:28 -04:00
Eric Banks	90a053ea93	Don't change the mapping quality of MQ=255 reads in IR	2011-11-05 22:40:45 -04:00
Ryan Poplin	611a395783	Now properly extending candidate haplotypes with bases from the reference context instead of filling with padding bases. Functionality in the private Haplotype class is no longer necessary so removing it. No need to have four different Haplotype classes in the GATK.	2011-11-05 12:18:56 -04:00
Mark DePristo	e99871f587	Bug fix for decode loc -- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.	2011-11-04 13:20:54 -04:00
Mark DePristo	a340a1aeac	Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing due to malformed VCF files.	2011-11-04 11:44:24 -04:00
Mark DePristo	9f260c0dc1	Zero byte index bug fix for RandomlySplitVariants + cleanup -- vcfWriter2 was never being closed in onTraversalDone(), so the on the fly index file was being created but never actually properly written to the file. -- This bug is ultimately due to the inability of the GATK to allow multiple VCF output writers as @Output arguments, though -- Removed the unnecessary local variable iFraction, = 1000 * the input fraction argument. Now the system just uses a double random number and compares to the input fraction at all. Is there some subtle reason I don't appreciate for this programming construct?	2011-11-04 09:45:20 -04:00
Mauricio Carneiro	e89ff063fc	GATKSAMRecord refactor The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...). * No tools should create SAMRecord anymore, use GATKSAMRecord instead *	2011-11-03 15:43:26 -04:00
Laurent Francioli	385a6abec1	Fixed a bug that wrongly swapped the mother and father genotypes in case the child genotype missing.	2011-11-03 13:04:53 +01:00
Laurent Francioli	893787de53	Functions getAsMap and getNegLog10GQ now handle missing genotype case.	2011-11-03 13:04:11 +01:00
Eric Banks	e8bceb1eaa	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-02 21:13:54 -04:00
Eric Banks	52b16bf739	Must check whether there's a normal vs. extended pileup before asking for it.	2011-11-02 20:45:24 -04:00
Eric Banks	e1edd6bd12	Removing the min mapping quality argument since it wasn't being used in the normal processing of the pileups in UG - only for indel pileups. Instead, we apply the min base quality to the reads in the pileup for indels and define it to be the min 'confidence' of the base. Docs are updated but I didn't rename the argument as I don't want people to complain.	2011-11-02 20:32:58 -04:00
Ryan Poplin	e94fcf537b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-02 16:29:19 -04:00
Ryan Poplin	4d35272916	Bug fixes with Mauricio to functions in ReadUtils used by reduced reads and the haplotype caller.	2011-11-02 16:29:10 -04:00
Mark DePristo	8a2929c1dd	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-02 16:21:00 -04:00
Laurent Francioli	19ad5b635a	- Calculation of parent/child pairs corrected - Separated the reporting of single and double mendelian violations in trios	2011-11-02 18:35:31 +01:00
Eric Banks	967ff647b8	Reduced reads shouldn't contribute to Fisher Strand calculations	2011-11-02 13:07:20 -04:00
Eric Banks	cf0e699226	QualByDepth was inefficiently iterating over the pileup 2 times for some reason. Removed non-useful annotation classes.	2011-11-02 12:58:38 -04:00
Eric Banks	4501dce58d	Fixing merge conflict	2011-11-02 12:50:32 -04:00
Eric Banks	54331b44e9	New way of looking at the size of a pileup: there's a physical number of elements in the data structure and there's a representative depth of coverage (since a reduced read represents depth >= 1). The size() method has been removed because its meaning is ambiguous. Updated several annotations and the UG engine to make use of the representative depths.	2011-11-02 12:47:30 -04:00
Mark DePristo	c2b97030a4	IntervalUtils for completely balanced locus-based scatter/gather -- scatterLocusIntervals master utility -- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc -- Util function for reversing a list (List<T> -> List<T>, unlike Collections version) -- DoC is PartitionType.INTERVAL -- Significant unit tests on new functionality (all passing) -- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work	2011-11-02 10:49:40 -04:00
Laurent Francioli	119ca7d742	Fixed a bug in parent/child pairs reporting causing a crash in case the -mvf option was used and mother was not provided	2011-11-02 08:22:33 +01:00
Laurent Francioli	b91a9c4711	- Fixed parent/child pairs handling (was crashing before) - Added parent/child pair reporting	2011-11-02 08:04:01 +01:00
Mark DePristo	5fc613f972	Better default partition types for walkers -- Added PartitionType.READ, and associated ReadScatterFunction. ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better -- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.	2011-11-01 19:47:10 -04:00
Mauricio Carneiro	36600fd8e9	added MQ of low MQ/BQ to consensus RMS Bases that were excluded for MQ and BQ filters are now contributing to the MQ RMS (but not to consensus base counts and variant/not variant region triggers).	2011-11-01 17:46:12 -04:00
Mauricio Carneiro	b004489c6d	Moving ReduceRead TAG to GATKSAMRecord ReduceReads are now a feature of a GATKSAMRecord, so the tag and the special methods needed to use it will now be housed by the GATKSAMRecord.	2011-11-01 17:12:09 -04:00
Mauricio Carneiro	17cc484dbd	Revert "ReduceReads ref bases are now output as '=' Reducing the reference bases to '=' results in an extra compression of 13% on average. The GATK is not ready to handle files with '=' bases, and the decision was to implement this a an engine support, not a part of ReduceReads.	2011-11-01 16:35:07 -04:00
Eric Banks	0839c75c8d	More minor fixes to docs	2011-10-31 21:49:27 -04:00
Eric Banks	74b018a1f3	Minor fixes to docs	2011-10-31 21:41:43 -04:00
Eric Banks	31ee5432c5	Merged bug fix from Stable into Unstable	2011-10-31 14:56:59 -04:00
David Roazen	cdde32acbd	Merged bug fix from Stable into Unstable	2011-10-31 14:21:15 -04:00
Eric Banks	f62af0291b	Check for invalid VCF records (not enough tokens) instead of assuming they are there.	2011-10-31 14:09:51 -04:00
Andrey Sivachenko	bed0acaed4	nWayOut now adds PG tag to the header as it should. Also, additional hidden option added: keepPGTags. If invoked, IndelRealigner PG tags from previous runs (if any) are kept in the header and the new PG tag is simply added, instead of overriding them	2011-10-31 12:28:28 -04:00
Mauricio Carneiro	389380a590	ReduceReads ref bases are now output as '=' to save space Restructured the sliding window framework to manipulate a wrapped version of the SAMRecord that contains information about the reference.	2011-10-30 12:04:39 -04:00
Eric Banks	0ca7428e76	Allow processing of empty intervals, but warn user when this case is encountered.	2011-10-28 12:12:14 -04:00
Eric Banks	649dfe98f0	Add VCF header for any expressions that are requested	2011-10-28 10:22:19 -04:00
Eric Banks	057a79f598	This argument should be annotated as @Input	2011-10-28 09:44:49 -04:00
Eric Banks	4ba7c0cecd	Moving to private	2011-10-28 09:29:28 -04:00
Eric Banks	1bdd76c2f2	These tools now use the IntervalBinding system to handle intervals instead of doing it all manually	2011-10-28 09:28:12 -04:00
Eric Banks	6ba08a103d	Empty ROD files should generate an exception when used for creating intervals. Moved some now obsolete files to the archive as the realigner will now read all target intervals into memory.	2011-10-28 09:23:25 -04:00
Eric Banks	3d04bb5608	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-27 23:55:18 -04:00
Eric Banks	19e27d4568	Removing all instances of -BTI (in tests and in GATKdocs) and replacing them with the appropriate alternative.	2011-10-27 23:55:11 -04:00
Eric Banks	cafc245a43	For some reason, a class of Codecs (including TableCodec) require that a GenomeLocParser be passed in to do the position processing. Why can't they just return a Feature with chr, start, stop? Isn't that the right thing?	2011-10-27 23:54:28 -04:00
Guillermo del Angel	cbc43683ee	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-27 20:54:18 -04:00
Guillermo del Angel	8907e42007	First fully functional implementation of ValidationSiteSelectorWalker. User gives a) a set of input variants, b) a desired number of output variants, b) Optionally, a set of samples which will restrict sites to be polymorphic in those samples, c) a frequency selection mode: either uniform (no AF matching), or matching AF so that output sites mirror the input AF spectrum as closely as possible. More testing is needed and docs need improving but so far all functionality seems up and running	2011-10-27 20:53:48 -04:00
Eric Banks	ccfd853b34	Added further integration tests for rod-based intervals that deal with more complex cases. Good call by Mark to test the empty VCF example because we were failing on it; fixed.	2011-10-27 20:43:50 -04:00
Eric Banks	c2f343773e	Oops, working too quickly last time. This is the proper fix for the potential NPE in the equals() test.	2011-10-27 15:32:08 -04:00
Khalid Shakir	b80d407dc7	No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path. Other minor cleanup.	2011-10-27 14:17:07 -04:00
Eric Banks	8c4dbce6d8	Don't serialize the GATKArgumentCollection for the GATKRunReports (which would have meant dealing with the new IntervalBindings). Also, forgot to remove a test that's no longer relevant to BED parsing.	2011-10-27 13:58:19 -04:00
Eric Banks	4a7e6fee3f	Remove support for BED file interval parsing in the GATK; it should all go through Tribble now. IndelRealigner no longer supports unordered interval input (which shouldn't have been used anyways). Temporarily commenting out serialization of arguments so that tests pass; this whole piece will be deleted soon anyways.	2011-10-27 13:38:08 -04:00
Matt Hanna	f7df8bdecc	Merged bug fix from Stable into Unstable	2011-10-27 11:31:17 -04:00
Matt Hanna	41ddc7bce7	Make sure we output a full stack trace when we encounter Tribble error messages on VCF header merge.	2011-10-27 11:30:04 -04:00
Eric Banks	44f905b5e5	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-26 23:31:11 -04:00
Eric Banks	68283b1651	Fixing docs and adding GATKdocs for the new interval functionality	2011-10-26 22:14:43 -04:00
Mark DePristo	c9978316a3	Merge branch 'FragmentUtils'	2011-10-26 19:51:49 -04:00
Mauricio Carneiro	add9ad97ec	No scatter gather for VQSR or ApplyVQSR. These walkers should not be scatter gatherable. Annotating them accordingly so that Queue doesn't allow a less than knowledgeable user to try and scatter/gather VQSR.	2011-10-26 16:35:44 -04:00
Ryan Poplin	74aeb22eeb	Merged bug fix from Stable into Unstable	2011-10-26 15:57:30 -04:00
Ryan Poplin	86871bd1e3	Throw a UserException in the BQSR when there is no data instead of creating an empty csv file	2011-10-26 15:56:41 -04:00
Mark DePristo	034a997d07	Generalized Reads -> Fragment calculation -- Supports ReadBackedPileup -> FragmentCollection as before -- Added support for List<SAMRecord> -> FragmentCollection for Ryan's haplotype caller -- General cleanup, renaming, move to separate package, more extensive unit tests, etc. -- Added toFragment() function to ReadBackedPileup interface	2011-10-26 15:54:38 -04:00
Eric Banks	2f21b6ecfb	Removed debugging output	2011-10-26 15:50:20 -04:00
Eric Banks	b39fcb1bea	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-26 15:44:25 -04:00
Eric Banks	b6ce6ed3f8	Go around the ROD system for now so that we can just call decodeLoc() for efficiency. Noted that we should go through the ROD system once it gets cleaned up. This means that currently gzipped files are not supported with -L.	2011-10-26 15:42:53 -04:00
Eric Banks	9424e8b2ca	Initial working version of new interval system in which the argument for -L (and -XL) is allowed to be a rod file (e.g. VCF). Old samtools-style intervals still behave as before. BTI is no longer supported. The merging (union or intersection) of intervals is now consistently applied to all -L (or -XL) intervals, which is nice. More testing needed.	2011-10-26 14:11:49 -04:00
Mark DePristo	7fa943aef1	Renamed FragmentPileup to FragmentUtils	2011-10-26 14:01:45 -04:00
Laurent Francioli	1f044faedd	- Genotype assignment in case of equally likeli combination is now random - Genotype combinations with 0 confidence are now left unphased	2011-10-26 19:57:09 +02:00
Laurent Francioli	81b163ff4d	Indentation	2011-10-26 14:49:12 +02:00
Laurent Francioli	62cff266d4	GQ calculation corrected for most likely genotype	2011-10-26 14:40:04 +02:00
Mark DePristo	af3613cc5f	GATKSAMRecord commit branch summary First, I'm sure there's a better way to do this, but I wanted to create a single commit summarizing the changes from my branch SamRecordFactory. What's the best way to do this? Rebase? Now, on to the changes here: -- Picard added a SamRecordFactory that is used to create instances the subclass SamRecord or BAMRecord. This factory allows us to have low-level picard readers (SamFileReader) create objects of type GATKSamRecord. The abomination of the extends and contains GATKSamRecord is now gone. GATKSamRecords are now produced by this factory, the GATK provides this factory to our SamFileReaders, and everything works with GATKSamRecord just extending BAMRecord. This results in up to a 2x performance improvement in writing BAMs and a ~10% improvement when reading BAMs files. -- As a consequence of this, we no longer officially support SAM records. Attempting to create SAMRecord objects with the factory will throw a user exception. -- Created a standard NGSPlatform enum, and GATKSamRecords support efficiently obtaining this value. The real BQSR (not the copy indel version) got the efficient code to use this. Please add all future platforms to this enum. -- GATKSamRecord no longer supports using the OQ or defaultBaseQuality. This is performed in a wrapper iterator that's only added when these command line options are used. -- ReducedRead code has been moved from ReadUtils until efficiency caching assessors in GATKSamRecord. -- ArtificialSamUtils creates GATKSamRecords now, just SAMRecords. Added code here to create artifical pairs and using that code to create artificial ReadBackedPileups with specific properties -- New smarter algorithm for FragmentPileup. This new code is up to 3x faster than the previous version, and is lazy so is more efficient when no overlapping pairs are actually in the pileup. Created extensive DataProvider driven UnitTest. Added Caliper-based benchmarking system to characterize the performance differences between the old and new algorithms. TODO still remains to make a efficient version that works for non-pileups for the HaplotypeCaller	2011-10-25 20:52:56 -04:00
Mark DePristo	2822f0dc27	Merge branch 'SamRecordFactory'	2011-10-25 20:34:47 -04:00
Mark DePristo	1b722c21cf	merge master	2011-10-25 16:08:39 -04:00
Ryan Poplin	56fdf0b865	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-25 15:58:56 -04:00
Ryan Poplin	4a34c1862e	misc cleanup. We now filter out haplotypes when it is obvious that the assembly has failed to find a parsimonious event rather than use haplotypes with large numbers of SNPs and small indels on them.	2011-10-25 15:22:28 -04:00
Guillermo del Angel	b559936b7a	a)New variant eval stratification module for indel size. b) Next iteration on indel caller runtime optimization: when computing likelihood of each haplotype for a given read, many computations will be redundant since pieces of haplotypes will be common to both REF and ALT haplotypes. So, we keep HMM matrices from one haplotype to the next one and recompute starting at the part where either haplotype is different or GOP/GCP are different.	2011-10-25 09:56:43 -04:00
Khalid Shakir	fac9932938	Embedding gsalib source and queueJobReport R scripts in the dist and package jars. Moved gsalib and queueJobReport.R to embeddable namespaced locations. Updated packager dependencies/dir to add an @includes which filters the embedded fileset. RScriptExecutor can now JIT compiles the gsalib. RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG. Refactored ProcessController and IOUtils from Queue to Sting Utils. Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count. Replaced uses of some IOUtils with Apache Commons IO. ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown. Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().	2011-10-24 15:58:34 -04:00
Khalid Shakir	89a581a66f	Added ability to specify arguments in files via -args/--arg_file Pushing back downsample and read filter args so they show up in getApproximateCommandLineArgs()	2011-10-24 15:58:34 -04:00
Mark DePristo	502592671d	Cleanup FragmentPileup before main repo commit -- removed intermiate functions. Now only original version and best optimized new version remain -- Moved general artificial read backed pileup creation code into ArtificialSamUtils	2011-10-24 14:40:05 -04:00
Mark DePristo	166174a551	Google caliper example execution script -- FragmentPileup with final performance testing	2011-10-24 14:04:53 -04:00
Laurent Francioli	62477a0810	Added documentation and comments	2011-10-24 13:45:21 +02:00
Laurent Francioli	38ebf3141a	- Now supports parent/child pairs - Sites with missing genotypes in pairs/trios are handled as follows: -- Missing child -> Homozygous parents are phased, no transmission probability is emitted -- Two individuals missing -> Phase if homozygous, no transmission probability is emitted -- One parent missing -> Phased / transmission probability emitted - Mutation prior set as argument	2011-10-24 12:30:04 +02:00
Laurent Francioli	7312e35c71	Now makes use of standard Allele and Genotype classes. This allowed quite some code cleaning.	2011-10-24 10:25:53 +02:00
Laurent Francioli	01b16abc8d	Genotype quality calculation modified to handle all genotypes the same way. This is inconsistent with GQ output by the UG but is correct even for cases of poor quality genotypes.	2011-10-24 10:24:41 +02:00
Mark DePristo	f6ccac889b	Merged bug fix from Stable into Unstable	2011-10-23 16:37:12 -04:00
Mark DePristo	585a45b7a3	Bug fix for ClipReadsWalker when stats output isn't provided -- See http://getsatisfaction.com/gsa/topics/clipreadswalker?utm_content=topic_link&utm_medium=email&utm_source=reply_notification	2011-10-23 16:36:48 -04:00
Ryan Poplin	f5d910b8a5	Haplotype caller now sends genotype likelihoods to the exact model to genotype the events found in the best haplotypes.	2011-10-23 13:29:08 -04:00
Mark DePristo	42bf9adede	Initial version of "fast" FragmentPileup code -- Uses mayOverlapRoutine in ReadUtils -- Attempts to be smart when doing overlap calculation, to avoid unnecessary allocations -- PileupElement now comparable (sorts on offset than on start) -- Caliper microbenchmark to assess performance	2011-10-22 21:36:37 -04:00
Mauricio Carneiro	4913f8a60f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-21 17:45:07 -04:00
Mauricio Carneiro	102dafdcbc	Validation of GATKSamRecord in read filters Moved the validation of the GATKSamRecord to the MalformedReadFilter with the intent to make the read filter the ultimate validation location for sam records. This way we can opt to filter out malformed reads if we know what we are doing or blow up otherwise.	2011-10-21 17:40:43 -04:00
Guillermo del Angel	f4b409fa0d	CombineVariants bug fix: when merging records with disparate alleles we were leaving AC,AF fields intact. This had as a consequence that we could end up with a record with 3 alt alleles but only 2 values in AC,AF fields. Now, if alleles in combined vc are different from original, and if AC,AF fields can't be recomputed from genotypes, we remove attributes from vc map since they'll be invalid anyway. Integration test md5 changed since there were several badly merged records in result	2011-10-21 14:07:20 -04:00
Mark DePristo	b863390cb1	Moving reduced read functionality into GATKSAMRecord -- More functions take / produce GATKSAMRecords instead of SAMRecord	2011-10-21 13:28:05 -04:00
Mark DePristo	2403e96062	Renamed GATKSamRecord -> GATKSAMRecord for consistency. Better docs.	2011-10-21 09:59:24 -04:00
Mark DePristo	110e13bc1e	Merge branch 'master' into SamRecordFactory	2011-10-21 09:43:52 -04:00
Mark DePristo	be797a8a1f	Recalibrator now uses the much more efficient NGSPlatform in the cycle covariates system	2011-10-21 09:39:21 -04:00
Mark DePristo	ed74ebcfa1	GATKSamRecords with efficiency NGSPlatform method	2011-10-21 09:38:41 -04:00
Mark DePristo	94e1898d8f	A canonical set of NGS platforms as enums with convenient manipulation methods	2011-10-21 09:37:45 -04:00
Laurent Francioli	edea90786a	Genotype quality is now recalculated for each of the phased Genotypes. Small problem is that we unnecessarily loose a little precision on the genotypes that do not change after assignment.	2011-10-20 17:04:19 +02:00
Laurent Francioli	1c61a57329	Original rewrite of PhaseByTransmission: - Adapted to get the trio information from the SampleDB (i.e. from Pedigree file (ped)) => Multiple trios can be passed as argument - Mendelian violations and trio phasing possibilities are pre-calculated and stored in Maps. => Runtime is ~3x faster - Genotype combinations possible only given two MVs are now given a squared MV prior (e.g. 0/0+0/0=>1/1 is given 10^-16 prior if the MV prior is 10^-8) - Corrected bug: In case the best genotype combination is Het/Het/Het, the genotypes are now set appropriately (before original genotypes were left even if they weren't Het/Het/Het) - Basic reporting added: -- mvf argument let the user specify a file to report remaining MVs -- When the walker ends, some basic stats about the genotype reconfiguration and phasing are output Known problems: - GQ is not recalculated even if the genotype changes Possible improvements: - Phase partially typed trios - Use standard Allele/Genotype Classes for the storage of the pre-calculated phase	2011-10-20 13:06:44 +02:00
Laurent Francioli	ef6a6fdfe4	Added getAsMap -> returns the likelihoods as an EnumMap with Genotypes as keys and likelihoods as values.	2011-10-20 12:49:18 +02:00
Laurent Francioli	76dd816e70	Added getParents() -> returns an arrayList containing the sample's parent(s) if available	2011-10-20 12:47:27 +02:00
Mark DePristo	999a8998ae	Constructor for GATKSamRecord with header only, for unit testing	2011-10-19 17:51:48 -04:00
Mark DePristo	bba69701b5	Now creates GATKSamRecords now SamRecords	2011-10-19 17:49:17 -04:00
Christopher Hartl	cd8a6d62bb	You know how the wiki has a big section on commiting local changes to BRANCHES of the repository you clone it from? Yeah. It sucks if you don't do that. This commit contains: - IntronLossGenotyper is brought into its current incarnation - A couple of simple new filters (ReadName is super useful for debugging, MateUnmapped is useful for selecting out reads that may have a relevant unaligned mate) - RFA now matches my current local repository. It's in flux since I'm transitioning to the new traversal type. + the triggering read stash pilot required me to change the scope of some of the variables in the ReadClipping code, private -> protected. Those are all the changes there. - MendelianViolation restored to its former glory (and an annotator module that uses the likelihood calculation has been added) + use this rather than a hard GQ threshold if you're doing MV analyses. - Some miscellaneous QScripts	2011-10-19 17:42:37 -04:00
Mark DePristo	52345f0aec	Meaningful documentation string	2011-10-19 15:47:36 -04:00
Mark DePristo	1b38aa1a7e	Cleaning up reduced read code accessors	2011-10-19 15:46:44 -04:00
Eric Banks	d8d73fe4f2	Treat ./X genotypes as MIXED so that isHet, isHom, etc. still return the expected and correct values. Added docs to these accessors with contracts explicitly mentioned. Fixed case where NPE could be thrown.	2011-10-19 15:11:13 -04:00
Mark DePristo	7928b287fc	GATKSamRecord now produced by SAMFileReaders by default -- Removed all of the unnecessary caching operations in GATKSAMRecord -- GATKSAMRecord renamed to GATKSamRecord for consistency	2011-10-19 13:15:27 -04:00
Eric Banks	5a6468c11e	Allowing ./X genotypes and adding a unit test to ensure that this case is covered from now on (especially given that we may want to revert in the future). Reverting this change is really easy and entails uncommenting a few lines of code. But for now, despite Mark's objections, this case is allowed in the VCF spec and we are wrong not to allow it.	2011-10-19 11:52:05 -04:00
Eric Banks	48c4a8cb33	Make error messages clearer (even I was confused)	2011-10-19 11:49:16 -04:00
Eric Banks	6cadaa84c9	Just use validate() from super class since it does the same thing	2011-10-19 11:48:23 -04:00
Mark DePristo	df3e4e1abd	First working code to use SamRecordFactory to produce objects of our own design in SAMFileReader	2011-10-19 11:22:35 -04:00
Mauricio Carneiro	c27e2fb676	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-18 15:23:05 -04:00
Mark DePristo	f77f2eeb7d	Fix for new ID structure	2011-10-18 13:04:43 -04:00
Mark DePristo	1a92ee3593	No longer adds a binding of ID -> . when the ID field is dot in the VCF -- Really we should make ID a primary key in VariantContext. Putting it into the attributes is just annoying now	2011-10-18 10:57:02 -04:00
Ryan Poplin	e45fcb66eb	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-17 15:56:19 -04:00
Ryan Poplin	1e6794c539	fixing typo in VariantsToTable docs	2011-10-17 15:56:02 -04:00
Mark DePristo	0de8550f17	Merged bug fix from Stable into Unstable	2011-10-17 15:29:53 -04:00
Mark DePristo	c1329c4dde	Fixing a binary to logical or	2011-10-17 15:29:45 -04:00
Mark DePristo	9e4963efc8	Merged bug fix from Stable into Unstable	2011-10-17 15:27:38 -04:00
Mark DePristo	ec911ce5bb	Even better error messages	2011-10-17 15:27:22 -04:00
Mark DePristo	d065bf1715	Merged bug fix from Stable into Unstable	2011-10-17 15:25:47 -04:00
Mark DePristo	a7cf9cdc67	Fixing error message typo	2011-10-17 15:25:35 -04:00
Ryan Poplin	589df6b7cf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-17 14:35:14 -04:00
Ryan Poplin	6b02354d84	Adding a new getter in VariantsToTable to extract the indel event length.	2011-10-17 14:34:52 -04:00
Mark DePristo	3550798c4c	Merged bug fix from Stable into Unstable	2011-10-17 13:58:56 -04:00
Mark DePristo	4108a294f7	Better error message when a RodBinding file doesn't exist	2011-10-17 13:58:46 -04:00
Mark DePristo	cc76826f78	Merged bug fix from Stable into Unstable	2011-10-17 13:38:11 -04:00
Mark DePristo	fd4540cd32	Fixed extraordinarily subtle race condition with contracts invariant -- all of the methods in the class must be synchronized or the internal state can be inconsistent with the contract invariant when entering the class in a non-synchronized method, even when that method doesn't care about the object's internal state	2011-10-17 13:37:55 -04:00
Mark DePristo	5a881360df	Merged bug fix from Stable into Unstable	2011-10-13 15:54:43 -04:00
Mark DePristo	7cab6f6bb0	Bug fixes for thread unsafe simple timer and bad Ns treatment in AlignmentUtils -- SimpleTimer is now threadsafe using synchronized method keywords -- Bug fix for alignmentToByteArray() where the N case was refPos++ not the now correct refPos += elementLength	2011-10-13 15:53:12 -04:00
Mauricio Carneiro	e12ffb6547	Updating docs for GCContentByInterval This walker does not take any BAMs. It only walks over the reference.	2011-10-13 13:27:00 -04:00
Eric Banks	9aecd50473	Adding ability to exclude annotations from the VA and UG lists. As described in the docs, this argument trumps all others (including -all) so that we can get around the SnpEff issue brought up by Menachem. Added integration test for it.	2011-10-12 15:44:54 -04:00
Mauricio Carneiro	e53a952aeb	Added ION Torrent support to CountCovariates.	2011-10-12 01:57:02 -04:00
Mauricio Carneiro	a2733a451f	Added NotCalled feature to GAV Added "not called" and "no status" to the truth table. Very useful.	2011-10-11 19:31:45 -04:00
David Roazen	ae83420637	Merged bug fix from Stable into Unstable	2011-10-11 12:26:08 -04:00
David Roazen	794f275871	SnpEff is now marked as a RodRequiringAnnotation instead of an ExperimentalAnnotation. Having SnpEff grouped with the Experimental annotations was proving problematic, since it requires a rod. Placing it in its own group should improve the situation somewhat, making it easier to request "all annotations except for SnpEff".	2011-10-11 12:08:56 -04:00
David Roazen	cfd0ac8410	Merged bug fix from Stable into Unstable Conflicts: public/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java	2011-10-11 12:03:51 -04:00
David Roazen	24b72334b3	UnifiedGenotyper now correctly initializes the VariantAnnotator engine. This allows the annotation classes to perform any necessary initialization/validation. For example, it allows the SnpEff annotator to (among other things) validate its rod binding. This will prevent a NullPointerException when SnpEff annotation is requested but no rod binding is present. Added an integration test to cover this case so that it doesn't break again.	2011-10-11 12:02:05 -04:00
Guillermo del Angel	0429b38021	Merged bug fix from Stable into Unstable	2011-10-11 11:19:38 -04:00
Guillermo del Angel	1c485d8b5e	Forgot that no matter how trivial a change it's a good idea to compile first	2011-10-11 11:18:41 -04:00
Guillermo del Angel	6418f4d69b	Merged bug fix from Stable into Unstable	2011-10-11 11:13:18 -04:00
Guillermo del Angel	1975de1b32	Second try: hide --do_indel_quality in AnalyzeCovariates	2011-10-11 11:11:29 -04:00
Guillermo del Angel	6506ea83e8	Revert "Hide --do_indel_quality argument in AnalyzeCovariates. This shouldn't be documented nor used by external users"... a hidden passenger change made it through. This reverts commit 70e10ccb1be90dcff8f4485ae6ee036db2d1ac86.	2011-10-11 11:03:12 -04:00
Guillermo del Angel	4c1d8c8d44	Hide --do_indel_quality argument in AnalyzeCovariates. This shouldn't be documented nor used by external users	2011-10-11 11:01:06 -04:00
Eric Banks	77c983c5b5	No one claimed this walker and it doesn't have integration tests or GATKdocs so it doesn't belong in public.	2011-10-10 15:17:54 -04:00
Mark DePristo	fb72bcf732	DiffObjects no longer prints out the file name in the status so MD5 are stable	2011-10-10 15:10:57 -04:00
Mark DePristo	46e7370128	this.allele, getAlleles(), and getAltAlleles() now return List not set -- Changes associated code throughout the codebase -- Updated necessary (but minimal) UnitTests to reflect new behavior -- Much better makealleles() function in VC.java that enforces a lot of key constraints in VC	2011-10-09 11:45:55 -07:00
Mark DePristo	c67f6c076b	simpleMerge now preserves allele order -- UnitTests for dangerous PL merging cases in the multi-allelic case. The new behavior is correct	2011-10-08 17:39:53 -07:00
Mark DePristo	ec14a4a606	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-07 08:38:50 -07:00
Eric Banks	ca9cd9b688	Minor fix for merging intervals which hadn't been necessary when only merging from the left to right. Added integration tests to cover the parallelization of RTC.	2011-10-06 22:38:44 -04:00
Mark DePristo	c7864c7256	Filter application order is now deterministic, in the order defined by the walker -- For no apparent reason we were using a HashSet to store the ReadFilters, so the order of operations was really arbitrarily applied. The order now is (1) the order of the walker intrinsic filters (2) read group black list (if provided) (3) command line filters (if provided)	2011-10-06 18:51:40 -07:00
Mark DePristo	0b88af4af9	Counts of records failing filters are displayed sorted -- Stops random ordering of the output, as the counts are returned sorted by string name of the class -- Deleted now unused sh*tty assessors in Utils	2011-10-06 18:42:26 -07:00
Mark DePristo	d1e70d6ec2	Removed Nx counting of reads in metrics with -nt > 1	2011-10-06 18:29:26 -07:00
Eric Banks	c61804a450	Rename the long version of the argument name to more accurately reflect its purpose.	2011-10-06 16:14:04 -04:00
Eric Banks	61a3dfae24	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-06 15:58:04 -04:00
Eric Banks	6eb87bf58a	RTC now caches all intervals as GenomeLocs (which is expected to take < 1Gb whole genome based on back of the envelope calculations with Matt) so that 1) we don't have to worry about emitting outside of the leaves in the hierarchical reductions and 2) we can emit the intervals in sorted order which is a big performance plus for the realigner. Integration tests change only because intervals whose start=stop are now printed as chr:start instead of chr:start-stop.	2011-10-06 15:57:49 -04:00
Eric Banks	1b0735f0a3	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-06 13:41:45 -04:00
Eric Banks	c4dfc1fb8b	Temporary commit of parallelization support for RealignerTargetCreator. Tim begged us for this and I got assurances from Khalid/Matt that this would also be extremely helpful for the whole genome calling pipeline, so I spent a while working on this. Needs to be fixed up though because apparently only the leaves in the hierarchical reduce get their output aggregated. Worked out a better solution with Matt.	2011-10-06 13:41:36 -04:00
Mark DePristo	73f9d1f217	GATK read group requirement iron hand -- The GATK will now throw a user exception if it opens a SAM/BAM file that doesn't have at least one RG defined -- LIBS again throws an error if the complete list of samples isn't provided -- Updating ExmpleCountLociPipeline test to use the well-formated versions of the exampleBAM and exampleFASTA files in testdata, instead of the old broken ones in validation_data. -- Convenience constructors for UserExceptions.MalformedBAM	2011-10-06 08:40:35 -07:00
Mark DePristo	23845ac798	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-06 08:17:08 -07:00
Mark DePristo	daa5999489	Fixed typo in argument description	2011-10-06 08:16:25 -07:00
Guillermo del Angel	8a474e38ff	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-06 10:08:39 -04:00
Guillermo del Angel	93f7e632bd	Minor fix/enhancement for VariantEval: if a vcf has symbolic alleles, program would crash ungracefully - now we'll just skip record without processing. This is a big issue since we can't process 1000G integration files with code as is.	2011-10-06 10:07:46 -04:00
Mark DePristo	190be4d0d1	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-05 21:27:11 -07:00
Mark DePristo	8e6845806a	Allowing empty samples list in LIBS -- Right now we cannot process BAM files without read groups because we enforce the samples list to not be empty when there's a SAM record. Now if there are reads and there are no samples we add the "null" sample so that LIBS walks the reads properly	2011-10-05 21:26:21 -07:00
Matt Hanna	180c8f286f	Merged bug fix from Stable into Unstable	2011-10-05 20:37:43 -04:00
Matt Hanna	55b9f06527	Ensure that IndelRealigner n-way out option supports MD5 generation.	2011-10-05 20:36:28 -04:00
Mark DePristo	be2d29ce69	Final PED documentation	2011-10-05 15:17:41 -07:00
Mark DePristo	3226d5dc0d	Merge branch 'master' into ped	2011-10-05 15:03:09 -07:00
Mark DePristo	6a573437af	Details documentation arguments for -ped	2011-10-05 15:00:58 -07:00
Mark DePristo	e7c80f7c45	Renaming quantitative trait to OtherPhenotype which is now a String not a double -- we can now use PED file to represent population data or other arbitrary phenotype data, not just doubles	2011-10-05 12:26:33 -07:00
Mark DePristo	51ecc20867	getFamily() and associated methods implemented and tested -- Sample no longer serializable -- Sample now implements Comparable	2011-10-05 09:55:05 -07:00
Mark DePristo	a45d985818	TODO method stubs	2011-10-04 15:54:09 -07:00
Mark DePristo	fee89e47ff	Only throws an error when there are no samples but there are reads -- Handles the case when you are running a ROD traversal and yet the LIBS is still used to return null everywhere.	2011-10-04 06:50:54 -07:00
Mark DePristo	f552aede42	Only provide the sample names in the BAM file for efficiency	2011-10-04 06:50:12 -07:00
Mark DePristo	a27641e1fc	Cleaned up imports	2011-10-04 06:28:36 -07:00
Mark DePristo	b20689ff55	No longer supports extraProperties -- the underlying data structure is still present, but until I decide what to do for the extensible system I've completely disabled the subsystem -- Added code to merge Samples, so that a mostly full record can be merged with a consistent empty record. If the two records are inconsistent, an error is thrown -- addSample() in Sample.class now invokes mergeSample() when appropriate -- Validation types are now only STRICT or SILENT -- Validation code implemented in SampleDBBuilder -- Extensive unit tests for SampleDBBuilder	2011-10-03 19:20:33 -07:00
Mauricio Carneiro	3837aa45b4	Fixing conflicts Conflicts: public/java/test/org/broadinstitute/sting/utils/clipreads/ReadClipperUnitTest.java	2011-10-03 19:07:59 -07:00
Mark DePristo	2e3dc52088	Minor function renaming	2011-10-03 14:41:13 -07:00
Mark DePristo	dd71884b0c	On path to SampleDB engine integration -- PedReader tag parser -- Separation of SampleDBBuilder from SampleDB (now immutable) -- Removed old sample engine arguments	2011-10-03 12:08:07 -07:00
Eric Banks	c3eff7451a	Found a small inefficiency while profiling: we were still using String.split instead of ParsingUtils.split to break up array values in the INFO field. There was a noticeable (albeit not big) difference in the change when reading sites only files.	2011-10-03 14:20:39 -04:00
Mark DePristo	8ee0f91904	Remove residual processing tracker arguments	2011-10-03 09:50:01 -07:00
Mark DePristo	89ac50e86e	SampleDataSource -> SampleDB	2011-10-03 09:33:30 -07:00
Mark DePristo	93fba06cb5	Support for whitespace only lines	2011-10-03 09:30:10 -07:00
Mark DePristo	0604ce55d1	PedReader support for ; separated lines, not only newline	2011-10-03 09:19:58 -07:00
Mark DePristo	52f670c8b8	100% version of PedReader -- Passes all unit tests -- Added unit tests for missing fields	2011-10-03 06:12:58 -07:00
Mark DePristo	dd75ad9f49	95% PedReader -- Passes significiant unit tests -- Implicit sample creation for mom / dad when you create single samples -- Continuing cleanup of Sample and SampleDataSource	2011-09-30 18:03:34 -04:00
Andrey Sivachenko	c7898a9be7	inconsequential change in string constants printed into the vcf which noone uses anyway...	2011-09-30 16:40:21 -04:00
Mark DePristo	010899f886	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-30 15:51:09 -04:00
Mark DePristo	84160bd83f	Reorganization of Sample -- Moved Gender and Afflication to separate public enums -- PedReader 90% implemented -- Improve interface cleanup to XReadLines and UserException	2011-09-30 15:50:54 -04:00
Mauricio Carneiro	05fba6f23a	Clipping ends inside deletion and before insertion fixed.	2011-09-30 15:44:43 -04:00
Mark DePristo	c1cf6bc45a	PEDReader should be in samples	2011-09-30 14:22:19 -04:00
Mark DePristo	56f10b40a8	Fixing test bugs for WindowMaker that required empty sample list	2011-09-30 14:18:27 -04:00
Ryan Poplin	af6c053435	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-30 13:33:31 -04:00
Mark DePristo	810e8ad011	Removed getXByReaders() function from the engine -- These could be simplied in their downstream uses -- Or they could be replaced with a generic getSAMFileHeaders() function and then apply the getSamples(header) as desired downstream	2011-09-30 10:43:51 -04:00
Mark DePristo	178ba24c27	Move getSamplesForSamFile to SampleUtils -- A nearly identical piece of code already lived in SampleUtils. Now there are two functions, one taking a regular header and another grabbing the merged header from the GATK engine itself. Much cleaner	2011-09-30 10:28:18 -04:00
Mark DePristo	30d23942b1	Renamed ReadBackedPileup getXSampleName() functions to getXSample -- now that we don't have Sample objects floating around we don't have to have all of the Name extensions on our functions	2011-09-30 10:02:57 -04:00
Mark DePristo	3289a325fc	Removed final use of Sample in RBP	2011-09-30 09:57:39 -04:00
Mark DePristo	a69a4dda2f	SamplesDB no longer has null sample -- Updated getSamples().size() == 2 test in CallableLociWalker that really ensured there was one sample in the system	2011-09-30 09:56:23 -04:00
Mark DePristo	e055a78f6e	LIBS now requires at least one sample be present -- UnitTest provides a "null" sample for matching the reads without read groups	2011-09-30 09:49:35 -04:00
Mark DePristo	9860a2c989	Merge branch 'master' into ped	2011-09-30 09:28:18 -04:00
Mark DePristo	d901fed617	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-30 08:41:44 -04:00
Mauricio Carneiro	cabacf028d	Intermediate commit to fix interval skipping may need additional testing.	2011-09-29 18:45:12 -04:00
Mark DePristo	1765fbeb6b	Merge branch 'master' into ped	2011-09-29 17:18:51 -04:00
Mark DePristo	98ecaf8aa0	Support for ReducedReads with reduced counts and average quals -- ReadUtils and UnitTest updated to support new byte[] style -- Removed unnecessary read transformer in PairHMM	2011-09-29 17:18:39 -04:00
Mauricio Carneiro	9508220157	fixed hard clipping both ends inside deletion If both ends of the interval falls within a deletion in the read then hardClipBothEnds would cut the right tail first including the entire deletion, then fail to cut the left tail because there would not be any bases there anymore. Fixed.	2011-09-29 15:36:49 -04:00
Mark DePristo	625ffb6a07	LocusIteratorByState and ReadBackedPileups no long use Sample	2011-09-29 14:52:11 -04:00
Mark DePristo	b3a2371925	Merge branch 'master' into ped	2011-09-29 14:32:17 -04:00
Mark DePristo	68761a6e28	Removed sample from header	2011-09-29 14:13:05 -04:00
Mauricio Carneiro	a5e75cd14c	Outputting both consensus base qualities and counts The base qualities of a consensus reads are now the average quality of the bases forming the consensus base (most common base) and the consensus quality tag now carry an array with the counts of each base in the consensus. This should increase file size but improve calling sensitivity/specificity.	2011-09-29 12:54:41 -04:00
Mark DePristo	505416b6c0	Merge branch 'master' into ped	2011-09-29 12:22:39 -04:00
Mark DePristo	9536845e35	Cleaning up unused code in MV	2011-09-29 12:20:07 -04:00
Mark DePristo	5043d76c3d	Removing more bad uses of SampleDataSource creation	2011-09-29 12:16:34 -04:00
Mark DePristo	5c9227cf5e	Further cleanup of Sample database -- Removing more and more unnecessary code -- Partial removal of type safe Sample usage. On the road to SampleDB only	2011-09-29 11:50:05 -04:00
Mark DePristo	2a0cd556d3	Further cleanup of Sample -- Cleaned up interface functions in GAE -- Added Walker.getSampleDB() function which is an easier option for tools to get the samples db	2011-09-29 10:34:51 -04:00
Mark DePristo	e76f381628	Moved sample package from DataSources to gatk, and renamed it samples -- All associated changes to the codebase are just header updates	2011-09-29 09:57:15 -04:00
Mark DePristo	e197dcd1f3	Pre-cleanup commit of Sample and SampleDataSource -- SampleDataSource has all reader functionality disabled	2011-09-29 09:44:18 -04:00
Mark DePristo	4d31673cc5	No longer supporting YAML file allows us to delete 75% of the sample's codebase	2011-09-29 09:43:31 -04:00
Ryan Poplin	e366ee18bc	Adding ability to read in and make use of kmer quality tables during HMM evaluation	2011-09-29 07:46:19 -04:00
Mauricio Carneiro	fc86cd6fd8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/carneiro/gatk/RR into rr	2011-09-29 00:12:15 -04:00
Roger Zurawicki	4fd5630f6a	Added ReadClipper Unit Test * Includes tests that include HardClip to Read and Reference Coords. * Changed ReadUtils.HardClipByReferenceCoordinates from private to protected to allow for testing	2011-09-28 23:13:50 -04:00
Matt Hanna	9272ed03b5	Merged bug fix from Stable into Unstable	2011-09-28 21:26:43 -04:00
Matt Hanna	0acaf2df65	Fix an embarrassing issue where a specific configuration of minimal coverage over small intervals could cause reads to be dropped from the pileup. Nothing to see here...	2011-09-28 21:23:01 -04:00
Guillermo del Angel	c8d3a720f9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-28 18:17:34 -04:00
Guillermo del Angel	7e3cb45093	Further performance optim in banded hmm, about 60% speed improvement over current implementation now	2011-09-28 16:27:28 -04:00
Ryan Poplin	1b1ca80df2	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-28 16:17:39 -04:00
Ryan Poplin	3b73dc89fe	Making several esoteric arguments in the BQSR @Hidden. Adding basic support for Complete Genomics machine cycle.	2011-09-28 16:17:31 -04:00
Mauricio Carneiro	ff2f4df043	Fixed hardclipping inside indel (right tail) when hard clipping the right tail of a read falls inside a deletion, clipping should fall back to the last base before the deletion to follow the ReadClipper's contract.	2011-09-28 16:07:34 -04:00
Mauricio Carneiro	3c7b7f74ef	Optimized interval iteration Using a TreedSet to manipulate getToolkit.getIntervals() and being smart about which intervals to test makes interval clipping O(1) instead of O(n).	2011-09-28 16:07:34 -04:00
Mauricio Carneiro	5c9b659c02	clipping both ends of the reads was modifying the original read This goes against the ReadClipper contract, and was affecting the second part of the read that spans over multiple intervals. Fixed.	2011-09-28 16:07:34 -04:00
Guillermo del Angel	fe23e4d10c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-28 15:53:11 -04:00
Guillermo del Angel	e2b9030e93	First mostly fully functional implementation of banded pair HMM likelihood computation for indel caller. More experimentation to follow but it right now works in small data sets and at least it doesn't break existing things. Disabled by default at this point	2011-09-28 15:51:48 -04:00
Eric Banks	1b45f21774	Removing this command-line tool. Purposely not doing this in stable so that users who may still use it have time to find other options. But the docs are no longer on the wiki.	2011-09-28 13:18:32 -04:00
Eric Banks	1f0e354fae	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-28 13:13:21 -04:00
Eric Banks	bb619a9a3c	Fixing docs	2011-09-28 13:13:03 -04:00
Mark DePristo	5812004e06	Merge branch 'stable'	2011-09-28 11:36:40 -04:00
Mark DePristo	a5006831d7	Shows "" not empty space when default string value is ""	2011-09-28 11:35:52 -04:00
Mark DePristo	1e32281a15	Fix to not show -null when missing short name argument	2011-09-28 11:31:20 -04:00
Mauricio Carneiro	89544c209c	Fixing contracts changed return type to Pair, changing contracts accordingly.	2011-09-28 11:19:17 -04:00
Eric Banks	eacbee3fe5	Merged bug fix from Stable into Unstable	2011-09-27 20:35:18 -04:00
Eric Banks	43b0c98298	Fix docs	2011-09-27 20:34:46 -04:00
Eric Banks	232a6df11c	Add longhand form to the error message.	2011-09-27 20:29:31 -04:00
Eric Banks	1d6fcb6eb1	Revert "Add longhand form to the error message to prevent users from posting borderline dumb posts to GS." This reverts commit 75b2600527cfce05ae683cb394290ff2a80e8552.	2011-09-27 20:27:00 -04:00
Eric Banks	269b9826b6	Add longhand form to the error message to prevent users from posting borderline dumb posts to GS.	2011-09-27 20:26:36 -04:00
Mauricio Carneiro	3b6e43b7c4	Use reads that span multiple intervals * RR will now compress reads that span across multiple intervals correctly and output them in the correct order. * Fixed bug in getReadCoordinateForReferenceCoordinate where if the requested reference coordinate fell inside a deletion in the read the read would be clipped up to one element past the deletion.	2011-09-27 18:39:06 -04:00
Khalid Shakir	84bd355690	Merged bug fix from Stable into Unstable	2011-09-27 14:34:39 -04:00
Khalid Shakir	b090751f62	Fixed Ant / PluginManager issue where reflections was picking up all class files under current working directory due to "." in jar manifest classpaths. Updates to HybridSelectionPipeline: - Added annotations back via snpEff - Minor updates to VQSR paths and lowered memory	2011-09-27 14:33:57 -04:00
Eric Banks	26e71f6688	The Omni files have multiple records (with the same ALT) at a particular location, with one PASSing and the other(s) filtered. Chris, this is why using this file as both eval and comp leads to ref/no-call cells in the GenotypeConcordance table. However, this led to non-determinism in VE because the VCs were placed in a HashSet; we use a LinkedHashMap instead to bring back determinism.	2011-09-27 11:03:17 -04:00
Guillermo del Angel	ceffefa6a6	Intermediate version with banded pair HMM	2011-09-27 10:18:58 -04:00
Mark DePristo	e99ff3caae	Removed lots of old, and not to be used, HMM options -- resulted in massive code cleanup -- GdA will integrate his new banded algorithm here -- Removed: DO_CONTEXT_DEPENDENT_PENALTIES, GET_GAP_PENALTIES_FROM_DATA, INDEL_RECAL_FILE, dovit, GSA_PRODUCTION_ONLY	2011-09-27 10:08:40 -04:00
Mark DePristo	fa0efbc4ca	Refactoring of PairHMM to support reduced reads	2011-09-26 13:28:56 -04:00
Mark DePristo	a6b65d6347	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-26 13:26:21 -04:00
Mark DePristo	4f09453470	Refactored reduced read utilities -- UnitTests for key functions on reduced reads -- PileupElement calls static functions in ReadUtils -- Simple routine that takes a reduced read and fills in its quals with its reduced qual	2011-09-26 12:58:31 -04:00
Eric Banks	234b74dd05	Merged bug fix from Stable into Unstable	2011-09-26 11:47:23 -05:00
Eric Banks	317b95fa57	Fixing some annotator docs	2011-09-26 11:46:45 -05:00
Mauricio Carneiro	b76dbc72f0	Fixed interval navigation bug. If a read was hard clipped away from the current interval, all subsequent reads within that interval (not hardclipped) would be filtered out. Fixed.	2011-09-26 08:13:44 -04:00
Guillermo del Angel	9afccd11b1	Minor refactoring: add ability to MathUtils.normalizeFromLog10 to not go to linear domain but just substract max value from log values and return. Use this function in snp and indel GL computation.	2011-09-25 21:18:56 -04:00
Guillermo del Angel	3eef800889	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-24 21:20:11 -04:00
Guillermo del Angel	203517fbb7	a) Cleanups/bug fixes to previous commit to CombineVariants. b) Change md5 to reflect records that are now merged correctly. c) Change unit merge alleles test to reflect the fact that a null non-variant vc object is not valid and not supported because there's no way to codify such object in a vcf. The code correctly converts this to a non-variant single-base event with whatever the reference is at that location.	2011-09-24 19:08:00 -04:00
Mauricio Carneiro	c31f4cb2f6	Cleaning leading insertions With the current implementation, a read cannot start with a deletion or an insertion. Maybe this will change in the future, but for now, chop the leading insertion off.	2011-09-24 14:33:32 -04:00
Guillermo del Angel	cd058dd10f	a) Fixed md5 for legit change in UG output that now also no-calls genotypes w/0,0,0 in PL's in SNP case. b) First reimplementation of new vc merger of different types. Previous version did it in two steps, first merging all vc's per type and then trying to see if resulting vc's would be merged if alleles of one type were a subset of another, but this won't work when uniquifying genotypes since sample names would be messed up and GT sample names wouldn't match VC sample names. Now, it's actually simpler: when splitting vc's by type before merging, we check for alleles of one vc being a subset of alleles of vc of another type and if so we put them together in same list.	2011-09-24 13:40:11 -04:00
Mark DePristo	bb11951255	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-24 09:26:45 -04:00
Mark DePristo	8d9e136bba	Merge branch 'stable'	2011-09-24 09:26:28 -04:00
Mark DePristo	6804ab6d2f	Bug fix for NPE in very short GATK runs -- Was already in unstable, but not stable...	2011-09-24 09:25:29 -04:00
Mark DePristo	92acff46e5	Moved Haplotype into Utils root	2011-09-24 09:14:05 -04:00
Mark DePristo	f792353dcd	Framework for genotype unit test	2011-09-24 08:56:45 -04:00
Mark DePristo	c0bb0cb465	Make DiploidGenotype enum private to walkers.genotyper	2011-09-24 08:48:33 -04:00
Guillermo del Angel	3a4469a236	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-23 21:58:34 -04:00
Guillermo del Angel	0e74cc3c74	a) Treat SNP genotype likelihoods just as indels, in the sense that they're always normalized as PL's so one of them will always be zero. This creates minor numerical differences in Qual and annotations due to numerical approximations in AF computation. b) Intermediate CombineVariants fixes, not ready yet	2011-09-23 21:58:20 -04:00
Mauricio Carneiro	7cac75ae1d	Merged bug fix from Stable into Unstable	2011-09-23 19:00:43 -04:00
Mauricio Carneiro	fbe3c1e0b3	Adding warning on HardClipping Hard Clipping is still under heavy development and should not be used by anyone less prepared than MacGyver.	2011-09-23 19:00:19 -04:00
Mark DePristo	b66841f179	Static cache for binomial probability -- Very low level performance optimization	2011-09-23 17:29:34 -04:00
Mauricio Carneiro	1a45c331b2	bringing the latest bug fixes to Reduce Reads	2011-09-23 16:40:06 -04:00
Mauricio Carneiro	9ea40f2e41	Deletions/Insertions in hard clip and bug fixes * Deletions now count as hard clipped bases in order to recover the original alignment start of a clipped read. * Insertions do not count as hard clipped bases for the same reason. * This created a bug in the previous cigar cleaning function. Fixed.	2011-09-23 16:37:08 -04:00
David Roazen	40202c85e0	Merged bug fix from Stable into Unstable	2011-09-23 16:35:55 -04:00
David Roazen	e1cb5f6459	SnpEff annotator now assigns a functional class to each effect and distinguishes between actual effects and mere modifiers. -We now assign a functional class (nonsense, missense, silent, or none) to each SnpEff effect, and add a SNPEFF_FUNCTIONAL_CLASS annotation to the INFO field of the output VCF. -Effects are now prioritized according to both biological impact and functional class, instead of impact only. -Many of SnpEff's "low-impact" effects are now classified as "modifiers" with lower priority than every other effect. This includes such "effects" as DOWNSTREAM, UPSTREAM, INTRON, GENE, EXON, and others that really describe the location of the variant rather than its biological effect. This code will be short-lived (likely 1.2-only), as the next version of SnpEff will include most of these features directly. Checking this change into Stable+Unstable instead of Unstable because the current functional class stratification in VariantEval is basically broken and urgently needs to be fixed for production purposes.	2011-09-23 16:06:52 -04:00
Matt Hanna	e388c357ca	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-23 14:53:28 -04:00
Matt Hanna	cc23b0b8a9	Fix for recent change modelling unmapped shards: don't invoke optimization to combine mapped and unmapped shards.	2011-09-23 14:52:31 -04:00
Mark DePristo	e3d4efb283	Remove N2 EXACT model code, which should never be used	2011-09-23 11:55:21 -04:00
Mark DePristo	27ce3c822e	Merge branch 'stable'	2011-09-23 09:04:52 -04:00
Mark DePristo	2bb77a7978	Docs for all VariantAnnotator annotations	2011-09-23 09:04:16 -04:00
Mark DePristo	dd65ba5bae	@Hidden for DocumentationTest and GATKDocsExample	2011-09-23 09:03:37 -04:00
Mark DePristo	dfce301beb	Looks for @Hidden annotation on all classes and excludes them from the docs	2011-09-23 09:03:04 -04:00
Mark DePristo	4397ce8653	Moved removePLs to VariantContextUtils	2011-09-23 08:24:20 -04:00
Mark DePristo	c49cc623de	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-22 17:26:21 -04:00
Mark DePristo	5cf82f9236	simpleMerge UnitTest tests filtered VC merging	2011-09-22 17:05:12 -04:00
Mauricio Carneiro	96c875399c	Merging many bug fixes to reduce reads	2011-09-22 17:04:11 -04:00
Mauricio Carneiro	39b54211d0	Fixed hard clipping soft clipped bases after hard clips if soft clipped bases were after a hard clipped section of the read, the hard clip was clipping the left soft clip tail as if it were a right tail. Mayhem.	2011-09-22 15:46:55 -04:00
Mauricio Carneiro	1acf7945c5	Fixed hard clipped cigar and alignment start * Hard clipped Cigar now includes all insertions that were hard clipped and not the deletions. * The alignment start is now recalculated according to the new hard clipped cigar representation	2011-09-22 14:51:14 -04:00
Mauricio Carneiro	4e9020c9f7	Fixed alignment start for hard clipping insertions	2011-09-22 13:28:25 -04:00
Mark DePristo	ba5f83fee2	start of VariantContextUtils UnitTest -- tests rsID merging	2011-09-22 12:10:39 -04:00
Mark DePristo	93dd1faa5f	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-22 11:20:10 -04:00
Mark DePristo	a05c959e5a	Empty unit tests for VariantContextUtils -- will be expanded over the day	2011-09-22 11:20:07 -04:00
Christopher Hartl	4f4a0fc38a	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/chartl/dev/git	2011-09-22 11:01:58 -04:00
Christopher Hartl	982c47bfa7	Remove duplicate effort in ReadUtils (with apologies to Mauricio) Big (but not major) cleanup of code in ILG - mostly excising the old likelihood model Activated the early-abort check for ILG. I think it should be better this way.	2011-09-22 10:58:26 -04:00
Eric Banks	8f8b59a932	My interpretation of the VCF spec is that the FORMAT field should only be present if there is genotype/sample data. So the VCFCodec now throws an exception when it encounters such a case. I had to fix one of the integration test VCFs.	2011-09-21 22:23:28 -04:00
Christopher Hartl	dc96f6da79	Merge branch 'master' of ssh://chartl@gsa2/humgen/gsa-scr1/chartl/dev/git	2011-09-21 18:18:41 -04:00
Christopher Hartl	f9cdc119af	Added a method to ReadUtils that converts reads of the form 10S20M10S to 40M (just unclips the soft-clips). Be careful when using this - if you're writing a bam file it will be potentially written out of order (since the previous alignment start was at the M, not the S).	2011-09-21 18:16:42 -04:00
Christopher Hartl	faff6e4019	Failed to commit changes to the GATKReport required for more easy access when using the files as data sources (read: histograms) for walkers	2011-09-21 18:15:23 -04:00
Mauricio Carneiro	96768c8a18	Sending latest bug fixes to Reduce Reads to the main repository	2011-09-21 17:43:11 -04:00
Mauricio Carneiro	70335b2b0a	Hard clipping soft clipped reads to fix misalignments. Pre-softclipped reads (with high qual) are a complicated event to deal with in the Reduced Reads environment. I chose to hard clip them out for now and added a todo item to bring them back on in the future, perhaps as a variant region.	2011-09-21 17:12:01 -04:00
Christopher Hartl	ef05827c7b	Merge branch 'master' of ssh://chartl@tin.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-21 16:40:47 -04:00
Christopher Hartl	3b51d9106a	Adding in likelihood calculations for mendelian violations. Also fixing a minor and rare bug in SelectVariants when specifying family structure on the command line.	2011-09-21 16:40:29 -04:00
Mark DePristo	04968c88b3	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-21 15:43:25 -04:00
Mark DePristo	6bcfce225f	Fix for dynamic type determination for bgzip files -- GZipInputStream handles bgzip files under linux, but not mac -- Added BlockCompressedInputStream test as well, which works properly on bgzip files	2011-09-21 15:39:19 -04:00
Mark DePristo	9f6f0c443c	Marginally cleaner isVCFStream() function -- cleanup trying to debug minor bug. Failed to fix the bug, but the code is nicer now	2011-09-21 15:25:01 -04:00
Ryan Poplin	5fef6dc5d0	Merged bug fix from Stable into Unstable	2011-09-21 15:23:06 -04:00
Ryan Poplin	2585fc3d6c	Updating Rscript path doc text for Broad users	2011-09-21 15:22:26 -04:00
Mark DePristo	74f9ccf6dd	Merge	2011-09-21 11:30:11 -04:00
Mark DePristo	6592972f82	Putative fix for BAQ array out of bounds -- Old code required qual to be <64, which isn't strictly necessary. Now uses the Picard SAMUtils.MAX_PHRED_SCORE constant -- Unittest to enforce this behavior	2011-09-21 11:25:08 -04:00
Eric Banks	174859fc68	Don't allow whitespace in the INFO field	2011-09-21 11:14:54 -04:00
Mark DePristo	ecc7f34774	Putative fix for BAQ problem.	2011-09-21 11:09:54 -04:00
Mark DePristo	7d11f93b82	Final bugfix for CombineVariants -- Now handles multiple records at a site, so that you don't see records like set=dbsnp-dbsnp-dbsnp when combining something with dbsnp -- Proper handling of ids. If you are merging files with multiple ids for the same record, the ids are merged into a comma separated list	2011-09-21 10:58:32 -04:00
Mark DePristo	a91ac0c5db	Intermediate commit of bugfixes to CombineVariants	2011-09-21 10:15:05 -04:00
David Roazen	b04d8eab55	Merged bug fix from Stable into Unstable	2011-09-20 17:24:14 -04:00
Mauricio Carneiro	758ecf2d43	Bringing latest updates of ReduceReads to the master repository	2011-09-20 16:35:09 -04:00
David Roazen	d9ea764611	SnpEff annotator now adds OriginalSnpEffVersion and OriginalSnpEffCmd lines to the header of the VCF output file. This change is urgently required for production, which is why it's going into Stable+Unstable instead of just Unstable. The keys for the SnpEff version and command header lines in the VCF file output by VariantAnnotator (OriginalSnpEffVersion and OriginalSnpEffCmd) are intentionally different from the keys for those same lines in the SnpEff output file (SnpEffVersion and SnpEffCmd), so that output files from VariantAnnotator won't be confused with output files from SnpEff itself.	2011-09-20 16:30:55 -04:00
Mark DePristo	bffd3cca6f	Bug fix for reduced read; only adds regular bases for calculation -- No longer passes on deletions for genotyping	2011-09-20 15:07:06 -04:00
Mark DePristo	a1b4cafe7a	Bug fix for NPE when timer wasn't initialized	2011-09-20 13:59:59 -04:00
Mark DePristo	b7511c5ff3	Fixed long-standing bug in tribble index creation -- Previously, on the fly indices didn't have dictionary set on the fly, so the GATK would read, add dictionary, and rewrite the index. This is now fixed, so that the on the fly index contains the reference dictionary when first written, avoiding the unnecessary read and write -- Added a GenomeAnalysisEngine and Walker function called getMasterSequenceDictionary() that fetches the reference sequence dictionary. This can be used conveniently everywhere, and is what's written into the Tribble index -- Refactored tribble index utilities from RMDTrackBuilder into IndexDictionaryUtils -- VCFWriter now requires the master sequence dictionary -- Updated walkers that create VCFWriters to provide the master sequence dictionary	2011-09-20 10:53:18 -04:00
Mark DePristo	230e16d7c0	Merge branch 'master' into rodrewrite	2011-09-20 06:54:18 -04:00
Mark DePristo	aa8afa3899	Merge	2011-09-19 21:16:47 -04:00
Mauricio Carneiro	56106d54ed	Changing ReadUtils behavior to comply with GenomeLocParser Now the functions getRefCoordSoftUnclippedStart and getRefCoordSoftUnclippedEnd will return getUnclippedStart if the read is all contained within an insertion. Updated the contracts accordingly. This should give the same behavior as the GenomeLocParser now.	2011-09-19 14:00:00 -04:00
Mauricio Carneiro	080c957547	Fixing contracts for SoftUnclippedEnd utils Now accepts reads that are entirely contained inside an insertion.	2011-09-19 13:53:53 -04:00
Mauricio Carneiro	5e832254a4	Fixing ReadAndInterval overlap comments.	2011-09-19 13:28:41 -04:00
Christopher Hartl	ecb8466662	Merged bug fix from Stable into Unstable	2011-09-19 12:32:08 -04:00
Christopher Hartl	8143def292	Fix the -T argument in the DepthOfCoverage docs Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file	2011-09-19 12:31:47 -04:00
Christopher Hartl	034b868588	Revert "Fix the -T argument in the DepthOfCoverage docs" This reverts commit 0994efda998cf3a41b1a43696dbc852a441d5316.	2011-09-19 12:16:07 -04:00
Mark DePristo	cfde0e674b	Merge branch 'sgintervals'	2011-09-19 12:02:41 -04:00
Mark DePristo	3e93f246f7	Support for sample sets in AssignSomaticStatus -- Also cleaned up SampleUtils.getSamplesFromCommandLine() to return a set, not a list, and trim the sample names.	2011-09-19 11:40:45 -04:00
Mark DePristo	41ffb25b74	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-19 10:55:18 -04:00
Christopher Hartl	ca1b30e4a4	Fix the -T argument in the DepthOfCoverage docs Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file	2011-09-19 10:29:06 -04:00
Mark DePristo	4ad330008d	Final intervals cleanup -- No functional changes (my algorithm wouldn't work) -- Major structural cleanup (returning more basic data structures that allow us to development new algorithm) -- Unit tests for the efficiency of interval partitioning	2011-09-19 10:19:10 -04:00
Mark DePristo	6ea57bf036	Merge branch 'master' into sgintervals	2011-09-19 09:50:19 -04:00
Mark DePristo	6bd42c053d	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-18 20:18:39 -04:00
Roger Zurawicki	091c7197cd	Fixed memory leak and bug with deletions in clipping The ClippingOp clip cigar function would run into a endless loop if the parameter were out of the reads range, I stopped the bug. * There is no check to make sure the read coordinate are covered by the read though When Hard clipping to interval, I added a check for deletions. NOTE: method works for NA12878 WEx but needs to be more thoroughly tested/optimized	2011-09-18 19:21:51 -04:00
Guillermo del Angel	e7b9a009b7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-16 12:48:30 -04:00
Menachem Fromer	b2e8e11128	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-16 00:52:27 -04:00
Christopher Hartl	57b3efa2e2	Merge branch 'master' of ssh://chartl@tin.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-15 21:06:38 -04:00
Christopher Hartl	939babc820	Updating formating for ValidationAmplicons GATK docs	2011-09-15 21:05:51 -04:00
Christopher Hartl	9fdf1f8eb6	Fix some doc formatting for Depth of Coverage	2011-09-15 21:05:22 -04:00
Menachem Fromer	e6e9b08c9a	Must provide alleles VCF to UGCallVariants	2011-09-15 18:51:09 -04:00
David Roazen	d78e00e5b2	Renaming VariantAnnotator SnpEff keys This is to head off potential confusion with the output from the SnpEff tool itself, which also uses a key named EFF.	2011-09-15 17:42:15 -04:00

... 18 19 20 21 22 ...

2539 Commits (32ee2c7dffde3210e2c3b183f5f2fefd3a49af23)