gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	45603f58cd	Refactoring and unit testing GenomeLocParser -- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage -- Fully document all functions in GenomeLocParser -- Unit tests for things like parsePosition (shocking it wasn't tested!) -- Removed function to specifically create GenomeLocs for VariantContexts. The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly -- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc -- Unit test coverage at like 80%, moving to 100% with next commit	2013-01-30 09:47:47 -05:00
Mark DePristo	8562bfaae1	Optimize GenomeLocParser.createGenomeLoc -- The new version is roughly 2x faster than the previous version. The key here was to cleanup the workflow for validateGenomeLoc and remove the now unnecessary synchronization blocks from the CachingSequencingDictionary, since these are now thread local variables -- #resolves https://jira.broadinstitute.org/browse/GSA-724	2013-01-30 09:47:47 -05:00
Mark DePristo	92c5635e19	Cleanup, document, and unit test ActiveRegion -- All functions tested. In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads. New version should be correct -- Enforce correct ordering of supporting states in constructor -- Enforce read ordering when adding reads to an active region in add -- Fix bug in HaplotypeCaller map with new updating read spans. Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence -- Encapsulate isActive field with an accessor function -- Make sure that all state lists are unmodifiable, and that the docs are clear about this -- ActiveRegion equalsExceptReads is for testing only, so make it package protected -- ActiveRegion.hardClipToRegion must resort reads as they can become out of order -- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region. The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed. Has a minor impact on PLs and RankSumTest values -- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy	2013-01-30 09:47:12 -05:00
Mauricio Carneiro	29fd536c28	Updating licenses manually Please check that your commit hook is properly pointing at ../../private/shell/pre-commit Conflicts: public/java/test/org/broadinstitute/variant/VariantBaseTest.java	2013-01-29 17:27:53 -05:00
David Roazen	a536e1da84	Move some VCF/VariantContext methods back to the GATK based on feedback -Moved some of the more specialized / complex VariantContext and VCF utility methods back to the GATK. -Due to this re-shuffling, was able to return things like the Pair class back to the GATK as well.	2013-01-29 16:56:55 -05:00
Ami Levy-Moonshine	a1908a0eca	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-29 16:33:20 -05:00
Ami Levy-Moonshine	4aaef495c6	correct the help message	2013-01-29 16:33:12 -05:00
Ryan Poplin	bf25196a0b	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-28 22:33:13 -05:00
Ryan Poplin	e9c3a0acdf	fix typo	2013-01-28 22:18:58 -05:00
Ami Levy-Moonshine	a8a68697f1	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-28 20:18:51 -05:00
Guillermo del Angel	5995f01a01	Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet: a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push. b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value. c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress. d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing	2013-01-28 19:55:46 -05:00
Ami Levy-Moonshine	3f5c2e4989	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-28 19:04:52 -05:00
Ami Levy-Moonshine	c103623cf6	bug fix in my new function at SampleUtils.java	2013-01-28 19:04:39 -05:00
Ryan Poplin	d665a8ba0c	The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior.	2013-01-28 15:56:33 -05:00
Tad Jordan	8777e02aa5	R issue in Queue fixed. GSA-721	2013-01-28 14:42:20 -05:00
David Roazen	f63f27aa13	org.broadinstitute.variant refactor, part 2 -removed sting dependencies from test classes -removed org.apache.log4j dependency -misc cleanup	2013-01-28 09:03:46 -05:00
David Roazen	3744d1a596	Collapse the downsampling fork in the GATK engine With LegacyLocusIteratorByState deleted, the legacy downsampling implementation was already non-functional. This commit removes all remaining code in the engine belonging to the legacy implementation.	2013-01-28 01:50:30 -05:00
Mark DePristo	14d8afe413	Remove startSearchAt state variable from ActivityProfile -- New algorithm will only try to create an active region if there's at least maxREgionSize + propagation distance states in the list. When that's true, we are guaranteed to actually find a region. So this algorithm is not only truly correct but as super fast, as we only ever do the search for the end of the region when we will certainly find one, and actually generate a region.	2013-01-27 14:10:08 -05:00
Mark DePristo	c97a361b5d	Added realistic BandPassFilterUnitTest that ensures quality results for 1000G phase I VCF and NA12878 VCF -- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions. This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined. This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile -- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with	2013-01-27 14:10:08 -05:00
Mark DePristo	72b2e77eed	Linearize the findEndOfRegion algorithm in ActivityProfile, radically improving its performance -- Previous algorithm was O(N^2) -- #resolve GSA-723 https://jira.broadinstitute.org/browse/GSA-723	2013-01-27 14:10:06 -05:00
Mark DePristo	0fb238b61e	TraverseActiveRegions Optimizations and Bugfixes: make sure to record position of current locus to discharge active regions when there's no data -- Now records the position of the current locus, as well as that of the last read. Necessary when passing through regions with no reads. The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found -- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information	2013-01-27 14:10:06 -05:00
Mark DePristo	93d88cdc68	Optimization: LocusReferenceView now passes along the contig index to createGenomeLoc, speeding up their creation -- Also cleaned up some unused methods	2013-01-27 14:10:06 -05:00
Mark DePristo	52a28968a9	ART optimization: BandPassActivityProfile only applies the gaussian filter if the state probability > 0	2013-01-27 14:10:06 -05:00
Mauricio Carneiro	705cccaf63	Making SplitReads output FastQ's instead of BAM - eliminates one step in my pipeline - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.	2013-01-27 02:36:31 -05:00
Mauricio Carneiro	6ea7133d95	Updating licenses of latest moved files	2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine	99cb8d68e9	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-25 16:07:38 -05:00
Mark DePristo	b8c0b05785	Add contract to ensure that getAdapterBoundary returns the right result -- Also renamed the function to getAdaptorBoundary for consistency across the codebase	2013-01-25 16:05:17 -05:00
Mark DePristo	e445c71161	LIBS optimization for adapter clipping -- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS -- Added unittests to cover adapter clipping	2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine	b4447cdca2	In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample). Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.	2013-01-25 15:49:51 -05:00
Ami Levy-Moonshine	fc22a5c71c	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-25 11:47:38 -05:00
Ami Levy-Moonshine	eaf6279d48	adding RBP to the general calling pipeline and few other small changes to it (to make it run with the current bundel file names	2013-01-25 11:47:30 -05:00
Mark DePristo	008b617577	Cleanup the getLIBS function in LocusIterator -- Now throws an UnsupportedOperationException in the base class. Only LocusView implements this function and actually returns the LIBS	2013-01-25 11:07:28 -05:00
Eric Banks	6dd0e1ddd6	Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants. This allows us to move SelectVariants into the public suite of tools now.	2013-01-25 09:42:04 -05:00
Mark DePristo	c7a29b1d39	Fixed NPE in ActiveRegionUnitTest by allowing null supporting states in ActiveRegion	2013-01-24 13:48:00 -05:00
Mark DePristo	592f90aaef	ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region -- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878 -- Misc. commenting of the code -- Updated ActiveRegionExtension to include a min active region size -- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now	2013-01-24 13:48:00 -05:00
Mark DePristo	c96b64973a	Soft clip probability propagation is capped by the MAX_PROB_PROPAGATION_DISTANCE, which is 50 bp	2013-01-24 13:48:00 -05:00
Mark DePristo	0c94e3d96e	Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp -- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART -- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp	2013-01-24 13:47:59 -05:00
Mark DePristo	9e43a2028d	Making band pass filter size, sigma, active region max size and extension all accessible from the command line	2013-01-24 13:47:59 -05:00
Mark DePristo	cd91e365f4	Optimize getCurrentContigLength and getLocForOffset in ActivityProfile	2013-01-24 13:47:59 -05:00
Eric Banks	6790e103e0	Moving lots of walkers back from protected to public (along with several of the VA annotations). Let's see whether Mauricio's automatic git hook really works!	2013-01-24 11:42:49 -05:00
Mark DePristo	09edc6baeb	TraverseActiveRegions now writes out very nice active region and activity profile IGV formatted files	2013-01-23 13:46:01 -05:00
Mark DePristo	8e8126506b	Renaming IncrementalActivityProfile to ActivityProfile -- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV	2013-01-23 13:46:01 -05:00
Mark DePristo	e917f56df8	Remove old ActivityProfile and old BandPassActivityProfile	2013-01-23 13:46:01 -05:00
Mark DePristo	7fd27a5167	Add band pass filtering activity profile -- Based on the new incremental activity profile -- Unit Tested! Fixed a few bugs with the old band pass filter -- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties -- Add new UnitTest for BandPassIncrementalActivityProfile -- Added normalizeFromRealSpace to MathUtils -- Cleanup unused code in new activity profiles	2013-01-23 13:46:01 -05:00
Mark DePristo	eb60235dcd	Working version of incremental active region traversals -- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version. This means that ART walkers will now take much less memory than previously. On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory. Over the whole chr20 only 5K reads were ever held in ART at one time. -- Fixed bug in the way active regions worked with shard boundaries. The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries. This behavior was actually encoded in the unit tests, so those needed to be updated as well. -- Changed the way that preset regions work in ART. The new contract ensures that you get exactly the regions you requested. the isActive function is still called, but its result has no impact on the regions. With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions -- Added a few misc. useful functions to IncrementalActivityProfile	2013-01-23 13:46:00 -05:00
Mark DePristo	ce160931d5	Optimize creation of reads in ArtificialBAMBuilder -- Now caches the reads so subsequent calls to makeReads() don't reallocate the reads from scratch each time	2013-01-23 13:46:00 -05:00
Mark DePristo	e050f649fd	IncrementalActivityProfile, complete with extensive unit tests -- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added	2013-01-23 13:45:21 -05:00
Mark DePristo	8d9b0f1bd5	Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile -- Required before I jump in an redo the entire activity profile so it's can be run imcrementally -- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class. -- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality. Almost all of the misc. walker changes are due to this name update -- Code cleanup and docs for TraverseActiveRegions -- Expanded unit tests for ActivityProfile and ActivityProfileState	2013-01-23 13:45:21 -05:00
Mauricio Carneiro	7b8b064165	Last manual license update (hopefully) if everyone updates their git hook accordingly, this will be the last time I have to manually run the script. GSATDG-5	2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine	0fb7b73107	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine	826c29827b	change the default VCFs gatherer of the GATK (not just the UG)	2013-01-18 15:03:12 -05:00
Eric Banks	6a903f2c23	I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller. I've resigned myself instead to create a mapping from Allele to Haplotype. It's cheap so not a big deal, but really shouldn't be necessary. Ryan and I are talking about refactoring for GATK2.5.	2013-01-18 01:21:08 -05:00
Eric Banks	ded659232b	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 22:49:56 -05:00
Eric Banks	a623cca89a	Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call, we were incorrectly trying to find the reference position of a base on the read that didn't exist. Added integration test to cover this case.	2013-01-16 22:47:58 -05:00
Mark DePristo	738c24a3b1	Add tests to ensure that all insertion reads appear in the active region traversal	2013-01-16 16:25:36 -05:00
Eric Banks	79bc818022	Bug fix for VariantsToVCF: old dbSNP files can have '-' as reference base and those records always need to be padded.	2013-01-16 16:15:58 -05:00
Mark DePristo	2a42b47e4a	Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART -- UnitTests now include combinational tiling of reads within and spanning shard boundaries -- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads -- Updating HC and CountReadsInActiveRegions integration tests	2013-01-16 15:30:00 -05:00
Mark DePristo	ddcb33fcf8	Cache result of getLocation() in Shard so we don't performance expensive calculation over and over	2013-01-16 15:30:00 -05:00
Mark DePristo	4d0e7b50ec	ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties -- Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start) -- This stream can be handed back to the caller immediately, or written to an indexed BAM file -- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)	2013-01-16 15:29:59 -05:00
Eric Banks	e47a389b26	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 14:59:11 -05:00
Eric Banks	d18dbcbac1	Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base. Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...	2013-01-16 14:55:33 -05:00
Khalid Shakir	4ffb43079f	Re-committing the following changes from Dec 18: Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms. Updated SelectHeaders to print out full interval arguments. Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.	2013-01-16 12:43:15 -05:00
Eric Banks	445735a4a5	There was no reason to be sharing the Haplotype infrastructure between the HaplotypeCaller and the HaplotypeScore annotation since they were really looking for different things. Separated them out, adding efficiencies for the HaplotypeScore version.	2013-01-16 11:10:13 -05:00
Eric Banks	392b5cbcdf	The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases. This way walkers won't see anything except the standard bases plus Ns in the reference. Added option to turn off this feature (to maintain backwards compatibility). As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...	2013-01-16 10:22:43 -05:00
Eric Banks	4fb3e48099	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 00:13:38 -05:00
Eric Banks	0d282a7750	Bam writing from HaplotypeCaller seems to be working on all my test cases. Note that it's a hidden debugging option for now. Please let me know if you notice any bad behavior with it.	2013-01-16 00:12:02 -05:00
Eric Banks	d3baa4b8ca	Have Haplotype extend the Allele class. This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient). Also, now we can easily get the Haplotype associated with the best allele for a given read.	2013-01-15 11:36:20 -05:00
Mark DePristo	3c37ea014b	Retire original TraverseActiveRegion, leaving only the new optimized version -- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests	2013-01-15 10:24:45 -05:00
Eric Banks	94800771e3	1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted. 2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs. Until the HC uses meta data though, this won't work.	2013-01-15 10:19:18 -05:00
Mark DePristo	b8b2b9b2de	ManagingReferenceOrderedView optimization: don't allow a fresh RefMetaDataTracker in the frequent case where there's no reference meta data	2013-01-14 16:30:16 -05:00
Mark DePristo	7eea6b8f92	ReservoirDownsampler optimizations -- Add an option to not allocate always ArrayLists of targetSampleSize, but rather the previous size + MARGIN. This helps for LIBS as most of the time we don't need nearly so much space as we allow -- consumeFinalizedItems returns an empty list if the reservior is empty, which it often true for our BAM files with low coverage -- Allow empty sample lists for SamplePartitioner as these are used by the RefTraversals and other non-read based traversals Make the reservoir downsampler use a linked list, rather than a fixed sized array list, in the expectFewOverflows case	2013-01-14 16:30:16 -05:00
Mark DePristo	c7f0ca8ac5	Optimization for LIBS: PerSampleReadStateManager now uses a simple LinkedList of AlignmentStateMachine -- Instead of storing a list of list of alignment starts, which is expensive to manipulate, we instead store a linear list of alignment starts. Not grouped as previously. This enables us to simplify iteration and update operations, making them much faster -- Critically, the downsampler still requires this list of list. We convert back and forth between these two representations as required, which is very rarely for normal data sets (WGS NA12878 on chr20 is 0.2%, 4x WGS is even less).	2013-01-14 16:30:16 -05:00
Mark DePristo	5a5422e4f8	Refactor PerSampleReadStates into a separate class -- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager -- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager -- Make LocusIteratorByState final	2013-01-14 16:30:16 -05:00
Mark DePristo	5c2799554a	Refactor updateReadStates into PerSampleReadStateManager, add tracking of downsampling rate	2013-01-14 16:30:16 -05:00
Mark DePristo	a4334a67e0	SamplePartitioner optimizations and bugfixes -- Use a linked hash map instead of a hash map since we want to iterate through the map fairly often -- Ensure that we call doneSubmittingReads before getting reads for samples. This function call fell out before and since it wasn't enforced I only noticed the problem while writing comments -- Don't make unnecessary calls to contains for map. Just use get() and check that the result is null -- Use a LinkedList in PassThroughDownsampler, since this is faster for add() than the existing ArrayList, and we were's using random access to any resulting	2013-01-14 16:30:16 -05:00
Mark DePristo	19288b007d	LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir -- Added unit tests to ensure this behavior is correct	2013-01-14 16:30:16 -05:00
Mark DePristo	83fcc06e28	LIBS optimizations and performance tools -- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM -- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls -- Moved updateReadStates to ReadStateManager -- Removed the unnecessary wrapping of an iterator in ReadStateManager -- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples. Now those are just map native iterations -- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks. The new piece is code is much clearer and avoids several unnecessary function calls	2013-01-14 16:30:15 -05:00
Mark DePristo	ec05ecef60	getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS	2013-01-14 16:30:15 -05:00
Mark DePristo	3a6b4b43b7	Backporting LIBSPerformance improvements to original commit	2013-01-13 09:53:10 -05:00
Mark DePristo	f204908a94	Add some todos for future optimization to LIBS	2013-01-11 15:17:18 -05:00
Mark DePristo	e88dae2758	LocusIteratorByState operates natively on GATKSAMRecords now -- Updated code to reflect this new typing	2013-01-11 15:17:18 -05:00
Mark DePristo	94cb50d3d6	Retire LegacyLocusIteratorByState -- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK	2013-01-11 15:17:18 -05:00
Mark DePristo	cc0c1b752a	Delete old LocusIteratorByState, leaving only new LIBS and legacy	2013-01-11 15:17:18 -05:00
Mark DePristo	9e23c592e6	ReadBackedPileup cleanup -- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl -- ReadBackedPileupImpl was literally a shell class after we removed extended events. A few bits of code cleanup and we reduced a bunch of class complexity in the gatk -- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed -- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well. -- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups. -- Added commons-collections for IteratorChain	2013-01-11 15:17:18 -05:00
Mark DePristo	e3e3ae29b2	Final documentation for LocusIteratorByState	2013-01-11 15:17:18 -05:00
Mark DePristo	6a91902aa2	Fix final merge conflicts	2013-01-11 15:17:18 -05:00
Mark DePristo	b9a33d3c66	Split original and optimized ART into largely independent pieces -- Allows us to cleanly run old and new art, which now have different traversal behavior (on purpose). Split unit tests as well.	2013-01-11 15:17:18 -05:00
Mark DePristo	02130dfde7	Cleanup ART -- Initialize routine captures essential information for running the traversal	2013-01-11 15:17:17 -05:00
Mark DePristo	9b2be795a7	Initial working version of new ActiveRegionTraversal based on the LocusIteratorByState read stream -- Implemented as a subclass of TraverseActiveRegions -- Passes all unit tests -- Will be very slow -- needs logical fixes	2013-01-11 15:17:17 -05:00
Mark DePristo	8b83f4d6c7	Near final cleanup of PileupElement -- All functions documented and unit tested -- New constructor interface -- Cleanup some uses of old / removed functionality	2013-01-11 15:17:17 -05:00
Mark DePristo	fb9eb3d4ee	PileupElement and LIBS cleanup -- function to create pileup elements in AlignmentStateMachine and LIBS -- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing	2013-01-11 15:17:17 -05:00
Mark DePristo	2f2a592c8e	Contracts and documentation for AlignmentStateMachine and LocusIteratorByState -- Add more unit tests for both as well	2013-01-11 15:17:17 -05:00
Mark DePristo	cc1d259cac	Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement -- Added unit tests for this behavior. Updated users of this code	2013-01-11 15:17:17 -05:00
Mark DePristo	2c38310868	Create LIBS using new AlignmentStateMachine infrastructure -- Optimizations to AlignmentStateMachine -- Properly count deletions. Added unit test for counting routines -- AlignmentStateMachine.java is no longer recursive -- Traversals now use new LIBS, not the old one	2013-01-11 15:17:17 -05:00
Mark DePristo	80d9b7011c	Complete rewrite of low-level machinery of LIBS, not hooked up -- AlignmentStateMachine does what SAMRecordAlignmentState should really do. It's correct in that it's more accurate than the LIB_position tests themselves. This is a non-broken, correct implementation. Needs cleanup, contracts, etc. -- This version is like 6x slower than the original implementation (according to the google caliper benchmark here). Obvious optimizations for future commit	2013-01-11 15:17:16 -05:00
Mark DePristo	0ac4352614	LIBS can now (optionally) track the unique reads it uses from the underlying read iterator -- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS -- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working -- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing	2013-01-11 15:17:16 -05:00
Mark DePristo	b3ecfbfce8	Refactor LIBS into component parts, expand unit tests, some code cleanup -- Split out all of the inner classes of LIBS into separate independent classes -- Split / add unit tests for many of these components. -- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works -- No need to change unit tests or integration tests. No change in functionality. -- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested	2013-01-11 15:17:16 -05:00
Mark DePristo	b2990497e2	Refactor LIBS into utils.locusiterator before refactoring	2013-01-11 15:17:16 -05:00
Mauricio Carneiro	9ed922d562	Updating licenses to Eric's last commit - for now we're still running the script by hand, soon automated solution will be in place. GSATDG-5	2013-01-11 14:33:00 -05:00
Eric Banks	e7906713d9	Moving some random walkers back to public as requested by Mark. Mauricio will the licenses get updated automatically?	2013-01-11 02:03:43 -05:00
Ami Levy-Moonshine	352cb831d0	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-10 21:27:06 -05:00
Ami Levy-Moonshine	fac0bce916	add RunCoveredByNSamplesSites; changes in CoveredByNSamplesSites so it can work in parallel; also, move it to diagnostics	2013-01-10 21:26:49 -05:00
Mauricio Carneiro	2a4ccfe6fd	Updated all JAVA file licenses accordingly GSATDG-5	2013-01-10 17:06:41 -05:00
Ryan Poplin	487fb2afb4	Bug fix for the case of overlapping assembled and partially-assembled events created by the HC. Unfortunately the symbolic allele can't be combined with the indel allele because the reference basis will change.	2013-01-09 15:30:46 -05:00
Eric Banks	4fa439d89e	Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now	2013-01-09 11:06:10 -05:00
Eric Banks	676e79542a	Bring CombineVariants back to public since it's used for SG. I needed to break ChromosomeCountConstants out of ChromosomeCounts to make this work.	2013-01-09 10:39:48 -05:00
Ryan Poplin	c87ad8c0ef	Bug fixes related to HC's GGA mode. Tracking just the artificial allele isn't sufficient when there are multiple GGA records that change the reference basis. Also, duplicated records screw up the tracking of merged alleles.	2013-01-09 10:00:46 -05:00
Ami Levy-Moonshine	15ca5015cd	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-08 21:53:36 -05:00
Ami Levy-Moonshine	d6071728e8	add new walker to find sites with good coverage	2013-01-08 17:10:38 -05:00
Eric Banks	264cc9e78d	Resolve protected->public dependencies for BQSR by wrapping the BQSR-specific arguments in a new class. Instead of the GATK Engine creating a new BaseRecalibrator (not clean), it just keeps track of the arguments (clean). There are still some dependency issues, but it looks like they are related to Ami's code. Need to look into it further.	2013-01-08 16:23:29 -05:00
Eric Banks	f0bd1b5ae5	Okay, all public->protected dependencies are gone except for the BQSR arguments. I'll need to think through this but should be able to make that work too.	2013-01-08 15:46:32 -05:00
Eric Banks	245fcc8bb5	Merged bug fix from Stable into Unstable	2013-01-08 12:59:15 -05:00
Eric Banks	d6146d369a	Remove all of the references to ProgramElementDoc	2013-01-08 12:58:31 -05:00
Eric Banks	47d030a52d	Oops, move the covariates over too	2013-01-07 15:47:25 -05:00
Eric Banks	35699a8376	Move bqsr utils to protected	2013-01-07 15:41:21 -05:00
Eric Banks	5371613ad1	Tests seem to pass (can't be positive though because I ran before Tad's recent push), so I'm going to push now (this push touches so many files that I don't want to keep it around much longer). Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-07 15:27:43 -05:00
Ami Levy-Moonshine	d4b4f95e12	move CatVariants to public	2013-01-07 15:07:16 -05:00
Eric Banks	a0219acfaa	Collapse the PerReadAlleleLikelihoodMap classes into 1 now that Lite is gone	2013-01-07 14:55:21 -05:00
Eric Banks	35d9bd377c	Moved (nearly) all Walkers from public to protected and removed GATKLite utils	2013-01-07 14:42:40 -05:00
Eric Banks	b4e7b3d691	Fixed precision problem in the Bayesian calculation of Qemp: we need to cap below max integer because the MathUtils code add +1. Added unit tests for handling large number of observations.	2013-01-07 13:07:36 -05:00
Tad Jordan	04e3978b04	Fixed VariantEval tests -Added sorting by rows to VariantEval	2013-01-07 12:45:32 -05:00
Ryan Poplin	4f95f850b3	Bug fix in the HC's allele mapping for multi-allelic events. Using the allele alone as a key isn't sufficient because alleles change when the reference allele changes during VariantContextUtils.simpleMerge for multi-allelic events.	2013-01-07 11:05:44 -05:00
Ami Levy-Moonshine	d3c2c97fb2	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-06 23:35:47 -05:00
Ami Levy-Moonshine	c554d9db25	add TODO	2013-01-06 23:04:38 -05:00
Ami Levy-Moonshine	81eef3aa37	merge development branchs of log-less HMM and FastGatherer to master	2013-01-06 23:01:58 -05:00
Eric Banks	0249e1f497	Resolving merge conflicts from VCF move	2013-01-06 14:32:31 -05:00
Eric Banks	8822b8e7c8	Moving HelpConstants out of HelpUtils so that we stop getting these ProgramElementDoc errors when com.sun.javadoc cannot load on a user's system.	2013-01-06 14:30:45 -05:00
Eric Banks	ea21dc9cfb	I just committed this - why didn't it work before? Trying again...	2013-01-06 12:44:13 -05:00
Eric Banks	52067f0549	Handle merge conflicts	2013-01-06 12:29:12 -05:00
Eric Banks	bf25e151ff	Handle long->int precision in Bayesian estimate	2013-01-06 12:26:32 -05:00
Eric Banks	b73d72fe94	update docs for LEftAlignVariants	2013-01-06 01:56:57 -05:00
Mark DePristo	2ab55e4ee7	Fixing bug in TraverseDuplicates.printProgress call: only passes in single location of genome loc	2013-01-05 12:50:27 -05:00
Mark DePristo	69bf70c42e	Cleanup and more unit tests for RecalibrationTables in BQSR -- Added unit tests for combining RecalibrationTables. As a side effect now has serious tests for incrementDatumOrPutIfNecessary -- Removed unnecessary enum.index system from RecalibrationTables. -- Moved what were really static utility methods out of RecalibrationEngine and into RecalUtils.	2013-01-05 12:50:27 -05:00
Chris Hartl	9df30880cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-04 17:15:22 -05:00
Joel Thibault	01738e70c3	Archive the experimental Active Region Traversals	2013-01-04 17:05:31 -05:00
Chris Hartl	7b7efa0fff	Add in the AAL as an experimental covariate, in case it's wanted.	2013-01-04 16:47:26 -05:00
Chris Hartl	41bc416b65	Remove AAL and update MD5s.	2013-01-04 16:46:14 -05:00
Eric Banks	bce6fce58d	Resolving merge conflicts after Mark's latest push	2013-01-04 14:46:39 -05:00
Eric Banks	dd7f5e2be7	Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests.	2013-01-04 14:43:11 -05:00
Ami Levy-Moonshine	80b531f695	emit all sites where more than 90% of the samples have good coverage	2013-01-04 14:27:50 -05:00
Tad Jordan	fe06912a87	Removed sorting by row from walkers	2013-01-04 11:52:33 -05:00
Mark DePristo	810e2da1d4	Cleanup and unit tests for EventType and ReadRecalibrationInfo in BQSR -- Added unit tests for EventType and ReadRecalibrationInfo -- Simplified interface of EventType. Previously this enum carried an index with it, but this is redundant with the enum.ordinal function. Now just using that function instead.	2013-01-04 11:39:25 -05:00
Mark DePristo	a5901cdd20	Bugfix for printProgress in TraverseReadsNano -- Must provide a single bp position (1:10) not the range of the read (1:1-50). ProgressMeter now checks at runtime for this problem as well.	2013-01-04 11:39:24 -05:00
Mark DePristo	bbdf9ee91b	BQSR cleanup: merge Advanced and Standard recalibration engine into just the RecalibrationEngine -- As we are no longer maintaining a public/protected system we need only have one RecalibrationEngine. -- Misc. code cleanup and docs along the way	2013-01-04 11:39:24 -05:00
Mark DePristo	7df47418d8	BQSR optimization: make RecalibrationTables thread-local, and merge results in onTraversalDone -- With the newer, faster BQSR, scaling was limited by the NestedIntegerArray. The solution to this is to make the entire table thread-local, so that each nct thread has its own data and doesn't have any collisions. -- Removed the previous partial solution of having a thread-local quality score table -- Added a new argument -lowMemory	2013-01-04 11:39:24 -05:00
Mark DePristo	1ba8d47a81	Unit tests for ProgressMeterDaemon	2013-01-04 11:39:24 -05:00
Joel Thibault	319d651e4a	Initial updates for ActiveRegionShard	2013-01-03 17:00:13 -05:00
Joel Thibault	e7553545ef	Initial updates for ReadShard	2013-01-03 17:00:13 -05:00
Joel Thibault	14a3ac0e3c	Enable the use of alternate shards	2013-01-03 17:00:13 -05:00
Joel Thibault	4cc372f53b	LocusShardDataProvider doesn't need its own GenomeLocParser	2013-01-03 17:00:13 -05:00
Joel Thibault	ffbd4d85f2	No need to pass fields as parameters	2013-01-03 17:00:12 -05:00
Tad Jordan	c1ba12d71a	Added unit test for outputting sorted GATKReport Tables - Made few small modifications to code - Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table	2013-01-03 16:53:59 -05:00
Ami Levy-Moonshine	10a705b27f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-03 13:42:31 -05:00
Ami Levy-Moonshine	2018285a39	better error message	2013-01-03 13:41:03 -05:00
Eric Banks	c7039a9b71	Pushing in implementation of the Bayesian estimate of Qemp for the BQSR. This isn't hooked up yet with BQSR; it's just a static method used in my testing walker. I'll hook this into BQSR after more testing and the addition of unit tests. Most of the changes in this commit are actually documentation-related.	2013-01-02 15:21:44 -05:00
Joel Thibault	c515175313	Ensure that active region extensions stay on contig	2013-01-02 14:46:24 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Mark DePristo	12f4c6307e	AutoFormattingTime cleanup and complete unittests -- Underlying system now uses long nano times to be more consistent with standard java practice -- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly -- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest	2013-01-02 11:29:25 -05:00
Mark DePristo	5558a6b8f7	Deleting / archiving no longer classes -- AminoAcidTable and AminoAcid goes to the archive -- Removing two unused SAMRecord classes	2012-12-29 14:34:17 -05:00
Mark DePristo	38cc496de8	Move SomaticIndelDetector and associated tools and libraries into private/andrey package -- Intermediate commit on the way to archiving SomaticIndelDetector and other tools. -- SomaticIndelDetector, PairMaker and RemapAlignments tools have been refactored into the private andrey package. All utility classes refactored into here as well. At this point, the SomaticIndelDetector builds in this version of the GATK. -- Subsequent commit will put this code into the archive so it no longer builds in the GATK	2012-12-29 14:34:08 -05:00
Ami Levy-Moonshine	f450cbc1a3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-27 21:23:59 -05:00
Eric Banks	75d5b88a3d	Enabling the Recal Report unit test (which looks like it was never ever enabled)	2012-12-26 15:35:50 -05:00
Eric Banks	efceb0d48c	Check for well-encoded reads while fixing mis-encoded ones	2012-12-26 14:30:51 -05:00
Mark DePristo	af9746af52	Fix merge failure	2012-12-24 13:43:04 -05:00
Mark DePristo	04cc75aaec	Minor cleanup and expansion of the RecalDatum unit tests	2012-12-24 13:35:58 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	295455eee2	NanoScheduler optimizations and simplification -- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc. All hideously complex. -- This new model is vastly simply. The reducer basically knows nothing about the control mechanisms in the NanoScheduler. It just supports multi-threaded reduce. The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down. The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1. The master thread gets the final reduce result when its free by the latch reaching 0. It's all super super simple. -- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.	2012-12-24 13:35:57 -05:00
Mark DePristo	aa3ee29929	Handle case where the ReadGroup is null in GATKSAMRecord	2012-12-24 13:35:57 -05:00
Mark DePristo	bf81db40f7	NanoScheduler reducer optimizations -- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access. If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work. They removes one major source of blocking contention in the NanoScheduler	2012-12-24 13:35:57 -05:00
Mark DePristo	940816f16a	GATKSamRecord now checks that the read group is a GATKReadGroupRecord, and if not makes one	2012-12-24 13:35:57 -05:00
Mark DePristo	14944b5d73	Incorporating clover into build.xml -- See http://gatkforums.broadinstitute.org/discussion/2002/clover-coverage-analysis-with-ant for use docs -- Fix for artificial reads not having proper read groups, causing NPE in some tests -- Added clover itself to private/resources	2012-12-24 13:35:57 -05:00
Mark DePristo	7796ba7601	Minor optimizations for NanoScheduler -- Reducer.maybeReleaseLatch is no longer synchronized -- NanoScheduler only prints progress every 100 or so map calls	2012-12-24 13:35:56 -05:00
Mark DePristo	0f04485c24	NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue -- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue. -- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step -- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code. -- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc) -- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs	2012-12-24 13:35:56 -05:00
Mark DePristo	b92f563d06	NanoScheduler optimization for TraverseReadsNano -- Pre-read MapData into a list, which is actually faster than dealing with future lock contention issues with lots of map threads -- Increase the ReadShard default size to 100K reads by default	2012-12-24 13:35:56 -05:00
Mark DePristo	f849910c4e	BQSR optimization: only compute BAQ when there's at least one error to delocalize -- Saves something like 2/3 of the compute cost of BQSR	2012-12-24 13:35:56 -05:00
Mark DePristo	0f0188ddb1	Optimization of BQSR -- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine. This object has a restrictive interface to just get information about specific qual and error values at offset and for event type. This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads. Shaves 5% of the runtime off the entire code. -- Cleaned up code and added lots more docs -- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR. 95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.	2012-12-24 13:35:09 -05:00
Mark DePristo	f6d5499582	The GATK engine now ensures that incoming GATKSAMRecords have GATKSAMReadGroupRecord objects in their header -- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord -- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string -- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety	2012-12-24 13:35:09 -05:00
Ami Levy-Moonshine	8be01af145	add the new gather tool to GATKExtensionsGenerator	2012-12-21 15:09:00 -05:00
Ami Levy-Moonshine	3ca3fd4b3e	keep working on loglessHMM in UG	2012-12-21 11:06:12 -05:00
Ami Levy-Moonshine	6590039bc3	add fast gather to UG; change UG to work with log-lessHMM (work in prograss)	2012-12-20 14:58:57 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
Eric Banks	6c3f5eefe9	Merged bug fix from Stable into Unstable	2012-12-19 22:29:21 -05:00
xingwei2012	22d13ccdab	Bug fix for Queue LSF v8.3 the function ls_getLicenseUsage() is not supported by LSF v8.x, comment the line: public static native lsfLicUsage.ByReference ls_getLicenseUsage() Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-19 22:28:53 -05:00
Ryan Poplin	54e5c84018	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-19 11:31:40 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Ryan Poplin	cda0c48570	auto-merge	2012-12-19 10:12:49 -05:00
Mark DePristo	1ca13f9581	Fundamentally better model for the NanoScheduler -- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine. -- Has the additional value of actually simplifying the code -- Resolves a long-standing annoyance with the nano scheduler.	2012-12-19 09:31:31 -05:00
David Roazen	d0cd29cb36	Merged bug fix from Stable into Unstable	2012-12-19 02:20:28 -05:00
David Roazen	0d93330ab9	Fix bug in the PerSampleDownsamplingReadsIterator that could lead to excessive memory usage at traversal startup This is a MUST-HAVE update for GATK 2.3 users who want to try out the new ability to use -dcov with ReadWalkers.	2012-12-19 02:05:36 -05:00
Yossi Farjoun	6ed9eb3da9	GATKBAMIndex now passes unit test! Problem was that SeekableBufferedStream seems to have a bug: it will read beyond the end of a file if asked to.	2012-12-18 17:32:26 -05:00
Ryan Poplin	902ca7ea70	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-18 15:45:33 -05:00
Ryan Poplin	3950f7b3e3	Increasing the INFORMATIVE_LIKELIHOOD_THRESHOLD value to 0.2	2012-12-18 15:45:12 -05:00
Ryan Poplin	b5d590ba92	Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph.	2012-12-18 15:43:56 -05:00
eitanbanks	002ce9c1d5	Merge pull request #8 from yfarjoun/master Huge speedup in initial traversal of BAM index files (x20 speed!)	2012-12-18 10:16:53 -08:00
Mark DePristo	16eb1c5436	Optimization to TraverseReadsNano -- Don't just read all inputs into a list, and then provide an iterator to that list, actually make a real iterator so NanoScheduler input thread can contribute meaningfully to the work load -- Use NanoScheduler progress function, instead of home-grown updater	2012-12-18 10:14:47 -05:00
Mark DePristo	b33f804cdc	Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls	2012-12-17 16:47:27 -05:00
Mark DePristo	66d32f646b	Minor cleanup of BAQ calculation (final variables, etc)	2012-12-17 16:47:27 -05:00
Mark DePristo	67fe81391c	ProgressMeter optimization: don't do genome loc formatting, but instead create an object that only formats when printing is actually needed	2012-12-17 16:47:27 -05:00
Mark DePristo	1de2f527b9	Optimization of recalibrateRead -- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application -- Trivial comment on how we might use BAQ better in BaseRecalibrator	2012-12-17 16:47:27 -05:00
Mark DePristo	bd6cda7542	Trivial optimization of TraverseReadsNano -- don't format the shard toString if logger isn't debug enabled	2012-12-17 16:47:27 -05:00
Mark DePristo	a481d006f0	Optimizations for applying BQSR table with PrintReads -- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops -- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one -- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable. This stops us from recreating an expensive array over and over. In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping. Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.	2012-12-17 16:47:27 -05:00
Mark DePristo	5ec25797b3	Optimizations for BaseRecalibrator -- No longer computes at each update the overall read group table. Now computes this derived table only at the end of the computation, using the ByQual table as input. Reduces BQSR runtime by 1/3 in my test	2012-12-17 16:47:27 -05:00
Eric Banks	e6f468b647	Refactored the quasi-useful IndelType annotation into the more useful VariantType. The indels are still annotated as before, but now all other variant types are annotated too. I'm doing this because of requests on the forum but am not making it standard. If we find it to be useful we can turn it on by default later.	2012-12-17 11:54:47 -05:00
Eric Banks	762f184262	Bug fix for strict validation: rsID checking wasn't working if there were multiple IDs	2012-12-17 10:32:41 -05:00
Yossi Farjoun	ea704d688f	chose smaller buffer size for the bufferedStream	2012-12-15 13:01:38 -05:00
Yossi Farjoun	6da2338ea7	removed comments and uneeded imports	2012-12-15 12:31:37 -05:00
Yossi Farjoun	19dd2d628a	some changes. some changes.	2012-12-14 17:21:32 -05:00
Mauricio Carneiro	74344a3871	Bringing in the changes from the CMI repo	2012-12-13 21:59:37 -05:00
Eric Banks	696bf95fba	Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer).	2012-12-13 23:28:30 +00:00
Mark DePristo	aeab932c63	Actual working version of unflushing VCFWriter -- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream. -- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping -- Same (or better) IO performance as previous implementation -- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed -- Updated compressed VCF output file	2012-12-13 16:15:08 -05:00
Yossi Farjoun	5e66109268	Replaced a useless getInt with a skipInt to remove 1/4 of the initial seek time in the BAM Index.	2012-12-12 17:08:11 -05:00
Eric Banks	62eaffdf0a	Fix docs for ReadBackedPhasing	2012-12-12 20:28:04 +00:00
Eric Banks	bba63a3b0e	Fix for GSA-615: UnifiedGenotyperEngine.getGLModelsToUse takes 5% of the runtime of UG, should be optimized away.	2012-12-12 20:25:45 +00:00
Mauricio Carneiro	a52e3c7e15	Revert "Bug fix for RR: don't let the softclip start position be less than 1" this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT). DEV-322 #resolve #time 4m This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.	2012-12-12 13:09:39 -05:00
Mark DePristo	5632c13bf2	Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush(). -- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)	2012-12-12 10:27:07 -05:00
Mark DePristo	dd52a70d45	Fix AFCalcResult unit test -- I was simply passing in the wrong values into the function. Fixed the calls, and expanded the docs on what needs to be passed in.	2012-12-11 10:40:12 -05:00
Ami Levy-Moonshine	6bf31065e3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-11 10:34:50 -05:00
Ami Levy-Moonshine	2e3284f306	Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true.	2012-12-10 22:23:58 -05:00
Mauricio Carneiro	8a115edbaf	ReduceReads is now scattered by contig It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.	2012-12-10 15:25:27 -05:00
Ami Levy-Moonshine	573ace4403	restore the right version of VariantContextUtils.java in my unstable dir	2012-12-10 10:28:56 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine	5460c96137	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine	3a420d163e	(1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012	2012-12-09 23:40:03 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Ami Levy-Moonshine	5d78a61f7a	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 15:07:12 -05:00
Mark DePristo	465694078e	Major performance improvement to the GATK engine -- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers. Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests. -- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x. For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement). -- Took this opportunity to improve the GATK ProgressMeter. NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress. This removes all inner loop calls to the GATK timers. -- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402	2012-12-05 14:49:22 -05:00
Mark DePristo	2b601571e7	Better error handling in NanoScheduler -- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown. Errors, like out of memory, would cause the whole system to die. This bugfix resolves that issue	2012-12-05 14:49:22 -05:00
Eric Banks	0c925856cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 02:00:39 -05:00
Eric Banks	ef87b18e09	In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then).	2012-12-05 02:00:35 -05:00
Mauricio Carneiro	30f013aeb0	Added a copy() method for ReadBackedPileups necessary to create new alignment contexts with hard-copies of the pileup.	2012-12-05 01:32:18 -05:00
Mauricio Carneiro	6feda540a4	Better error message for SimpleGATKReports	2012-12-05 01:32:18 -05:00
Randal Moore	8d2d0253a2	introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-03 22:33:02 -05:00
Eric Banks	67932b357d	Bug fix for RR: don't let the softclip start position be less than 1	2012-12-03 15:59:14 -05:00
Ryan Poplin	a47da9bb2f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 14:30:14 -05:00
Eric Banks	5fed9df295	Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh).	2012-12-03 12:18:20 -05:00
Eric Banks	b6839b3049	Added checking in the GATK for mis-encoded quality scores. The check is performed by a Read Transformer that samples (currently set to once every 1000 reads so that we don't hurt overall GATK performance) from the input reads and checks to make sure that none of the base quals is too high (> Q60). If we encounter such a base then we fail with a User Error. * Can be over-ridden with --allow_potentially_misencoded_quality_scores. * Also, the user can choose to fix his quals on the fly (presumably using PrintReads to write out a fixed bam) with the --fix_misencoded_quality_scores argument. Added unit tests.	2012-12-03 11:18:41 -05:00
Ryan Poplin	18b002c99c	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 10:08:56 -05:00
Ryan Poplin	1bdf17ef53	Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations.	2012-12-02 11:58:32 -05:00
Ami Levy-Moonshine	d0b8cc7773	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-01 00:08:25 -05:00
Ami Levy-Moonshine	969c995298	work under development - catVariants. Changes to AssessRRQuals based on Eric todo comments. bug fix in CombineVariants	2012-12-01 00:08:19 -05:00
Mark DePristo	8020ba14db	Minor cleanup of SAMDataSource as part of my system review -- Changed a few function from public to protected, as they are only used by the package contents, to simplify the SAMDataSource interface	2012-11-30 15:04:41 -05:00
Mauricio Carneiro	fc7fab5f3b	Fixed ReadBackedPileup downsampling Downsampling in the PerSampleReadBackedPileup was broken, it didn't downsample anything, always returning a copy the original pileup.	2012-11-30 00:42:05 -05:00
Joel Thibault	97d29f203e	Add walltime changes to LSF - Check whether the specified attribute is available - Add pipeline test (disabled due to missing attribute)	2012-11-29 15:23:37 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Ryan Poplin	f0395b457a	Adding the work-in-progress, experimental RepeatLengthCovariate to the BQSR so Chris can continue the development.	2012-11-28 13:56:32 -05:00
Eric Banks	3463774f2a	Merged bug fix from Stable into Unstable	2012-11-28 13:26:52 -05:00
Eric Banks	6030605242	Added quick check for creation of bad BAQ values associated with badly encoded base qualities; hopefully this can help us debug the non-reproducible issue seen by many users.	2012-11-28 13:26:31 -05:00
Mark DePristo	c676853731	Merged bug fix from Stable into Unstable. Updating md5s Conflicts: protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java	2012-11-28 12:54:36 -05:00
Mark DePristo	a1d6461121	Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds As reported by Menachem Fromer: a critical bug in AFCalcResult: Specifically, the implementation: public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) { return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef; } seems incorrect and should probably be: getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef The issue here is that the 30 represents a Phred-scaled probability of error and it's currently being compared to a log probability of non-error. Instead, we need to require that our probability of error be less than the error threshold. This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones. Change was to create a new function that properly handles thresholds that are PhredScaled quality scores: /** * Same as #isPolymorphic but takes a phred-scaled quality score as input */ public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) { if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 "); final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual)); return isPolymorphic(allele, log10Threshold); }	2012-11-28 12:08:02 -05:00
Menachem Fromer	79bc878e6a	Allow debugging to be set from the command line	2012-11-27 22:37:41 -05:00
Eric Banks	b40d3eb8aa	Merged bug fix from Stable into Unstable	2012-11-27 14:41:07 -05:00
Eric Banks	01abcc3e0f	Tests didn't like my note to Geraldine in the output logs; apparently it's tested in integration tests	2012-11-27 14:40:49 -05:00
Joel Thibault	d83ad906ef	Add profile range contract	2012-11-27 13:03:13 -05:00
Eric Banks	9531e58445	Merged bug fix from Stable into Unstable	2012-11-27 11:00:50 -05:00
Eric Banks	4543ece088	Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case.	2012-11-27 11:00:33 -05:00
Eric Banks	a82ec7ad80	Merged bug fix from Stable into Unstable	2012-11-27 10:27:08 -05:00
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Mauricio Carneiro	97fd5de260	Merging latest CMI updates with UNSTABLE	2012-11-27 09:08:00 -05:00
Eric Banks	b1969a66bd	Update docs	2012-11-27 08:24:41 -05:00
Eric Banks	cc72aaefeb	Minor efficiency: use >= instead of > in test	2012-11-27 01:11:23 -05:00
Eric Banks	405f3c675d	Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc.	2012-11-27 01:07:00 -05:00
Ryan Poplin	e27d677c13	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-26 12:20:32 -05:00
Ryan Poplin	c3b7dd1374	Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine	2012-11-26 12:19:11 -05:00
Eric Banks	4f7fa3009a	I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA.	2012-11-26 11:34:59 -05:00
Mauricio Carneiro	a3f5932501	Fixed null pointer exception in Integration Tests When running Utils.setupWriter with NO_PG_TAG set, the writer was attempting to create a program record with the null pointer. Fixed.	2012-11-26 11:12:27 -05:00
Ryan Poplin	fedc4fde6c	Merged bug fix from Stable into Unstable	2012-11-25 21:55:55 -05:00
Ryan Poplin	d978cfe835	Soft clipped bases shouldn't be counted in the delocalized BQSR.	2012-11-25 21:55:29 -05:00
Eric Banks	9719ba7adc	Remove -number example from the docs since it's no longer supported.	2012-11-22 21:53:42 -05:00
Menachem Fromer	2306518ab6	Fix to deal with 'proper' options of casting	2012-11-22 01:45:18 -05:00
Menachem Fromer	d33a412b5f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-22 01:42:29 -05:00
Mark DePristo	48f271c5bd	Adding 80% support for multi-allelic variants -- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here. It's not perfect and needs some discussion on how to handle het/alt variants -- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)	2012-11-21 17:24:59 -05:00
Joel Thibault	c08b782743	Count isActive calls directly	2012-11-21 17:16:45 -05:00
Eric Banks	4f2229d399	As per the TODO message, I removed a check that was no longer necessary. Now ID is an allowable INFO field key.	2012-11-21 16:01:26 -05:00
Menachem Fromer	06261b58c2	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-21 15:57:08 -05:00
Eric Banks	ed50814ccb	Finally found a case where user errors were being masked behind other errors and could debug. It turns out that the checkForMaskedUserErrors() method needs to run recursively over all levels (calling exception.getCause()) to check for the original cause.	2012-11-21 15:57:05 -05:00
Menachem Fromer	c8be7c3102	Keep SNPs and indels separately for batch merging; Add options to DepthOfCoverage to count fragments (to not double-count overlapping reads of same fragment); DepthOfCoverage should now support ReducedReads; Replace recusrion with loop in DoC/package.scala (for lists longer than 5000 elements)	2012-11-21 15:56:53 -05:00
Ami Levy-Moonshine	4714ccc284	change the way CombineVariants check the priority arguments in order to throw error when the genotypeMergeOption argument is set to PRIORITIZE but PRIORITY_STRING is not provided	2012-11-21 10:47:35 -05:00
Eric Banks	2e1a055aca	Merged bug fix from Stable into Unstable	2012-11-20 23:20:33 -05:00
Eric Banks	c54fc94505	Protect against features that start off the end of the read (otherwise, Arrays.fill fails)	2012-11-20 23:19:59 -05:00
Eric Banks	c2efb04657	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-20 22:43:15 -05:00
Eric Banks	72e2d569c5	The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality.	2012-11-20 22:41:57 -05:00
Mark DePristo	cc7680e601	NA12878 knowledge base backed by MongoDB -- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878. Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development -- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK -- Not hooked up to Tribble, but could be done reasonably easily now (future TODO) -- Tools to import callsets, create consensus callsets, import and export reviews -- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand) -- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set -- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters -- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus. -- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB. -- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from. Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker) UnitTests -- Can connect to a test knowledge base for development and unit testing -- PolymorphicStatus, TruthStatus, SiteIterator -- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc -- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb AssessNA12878 -- Generic tool for comparing a NA12878 callset against the knowledge base. See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation -- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs Misc -- Ability to provide Description for Simplified GATK report	2012-11-20 18:50:52 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Eric Banks	4f243acaa6	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-19 10:34:44 -05:00
Eric Banks	f0b8a0228f	Quick fix for HC refactoring: when copying over Haplotype objects, make sure to copy over the artificial allele used to create it too.	2012-11-19 09:57:55 -05:00
Eric Banks	ff180a8e02	Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process.	2012-11-19 09:09:57 -05:00
Eric Banks	78ce822b6f	Protect against NPE when using non-GATK reports for inputs expecting valid GATK reports	2012-11-19 09:07:04 -05:00
Joel Thibault	b70fd4a242	Initial testing of the Active Region Traversal contract - TODO: many more tests and test cases	2012-11-15 10:08:00 -05:00
Guillermo del Angel	a68e6810c9	Back off experimental code that escaped last commit, not for general use yet	2012-11-14 14:45:15 -05:00
Guillermo del Angel	89bbe73a43	Commenting out CMI pipeline test that wasn't meant to be in GATK repository (why was this merged??)	2012-11-14 14:39:04 -05:00
Guillermo del Angel	3771d074dc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-11-14 14:37:43 -05:00
Mauricio Carneiro	e35fd1c717	Merging CMI-0.5.0 and GATK-2.2 together.	2012-11-14 10:42:03 -05:00
Mauricio Carneiro	a079d8d0d1	Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	dba31018f4	Implementation of BySampleSAMFileWriter ReduceReads now works with the n-way-out capability, splitting by sample. DEV-27 #resolve #time 3m	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	a17cd54b68	Co-Reduction implementation in ReduceReads ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples. Also: * Added integrationtest for co-reduction * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly) DEV-200 #resolve #time 8m	2012-11-14 10:33:21 -05:00
kshakir	6d59dd3455	Scala classes were only returning direct subclasses (confirmed when inspected in debugger) so changed PluginManager to allow specifying the explicit subclass. Removed some generics from PluginManager for now until able to figure out syntax for requesting explicit subclass. QStatusMessenger uses a slightly more primitive Map[String, Seq[RemoteFile]] instead of Map[ArgumentSource, Seq[RemoteFile]]. Added an QCommandPlugin.initScript utility method for handling specialized script types.	2012-11-14 10:33:20 -05:00
Eric Banks	42ddf51156	Merged bug fix from Stable into Unstable	2012-11-14 10:29:09 -05:00
Eric Banks	ba41f65759	Protect against NPEs in SelectVariants by checking for missing Genotypes	2012-11-13 11:53:39 -05:00
Eric Banks	c7335c9902	Having a malformed GATK report is a User Error	2012-11-13 11:53:12 -05:00
Eric Banks	525cf331f4	Don't catch a User Error and re-throw as a Reviewed Exception. That makes Eric unhappy.	2012-11-13 11:52:47 -05:00
Eric Banks	ee776e996a	Merged bug fix from Stable into Unstable	2012-11-09 08:35:51 -05:00
Eric Banks	66cbaaee31	Fixed nasty bug in BQSR csv file creation: numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column). This wasn't caught earlier because there are no integration tests covering the csv. I'll add one into unstable in a sec.	2012-11-09 08:33:55 -05:00
Eric Banks	e9183d9fe0	Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec.	2012-11-08 15:07:47 -05:00
Eric Banks	17ab3a39d5	Make the --intermediate_csv_file argument un-hidden.	2012-11-08 14:35:23 -05:00
Eric Banks	f4d4846435	Merged bug fix from Stable into Unstable	2012-11-06 20:53:54 -08:00
Eric Banks	15b8c08132	Apparently CIGAR elements can have 0 length according to the spec, but 0Ms were causing left alignment of indels to fail. Fixed.	2012-11-06 20:53:33 -08:00
Mark DePristo	f8a0a947e3	Critical bugfix for GSA-652 / Multi-threaded VCF -> BCF writing produces invalid intermediate file that fails on merging -- New tribble library now uses 64 bit sizes. The 26K VCF has so much data that low-level tribble block indices where overflowing their int size values. This includes a to-be-committed tribble jar that fixes this problem -- See https://jira.broadinstitute.org/browse/GSA-652 -- Minor cleanup of error messages that were useful on the way to solving this monster problem	2012-11-02 09:09:59 -04:00
Ryan Poplin	386b45e94d	This VE eval module isn't useful anymore.	2012-11-01 15:44:41 -04:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Mark DePristo	9cd04c335c	Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data -- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases. -- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase() -- Added unit tests to ensure this functionality works correct. -- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case. -- Added contracts to ensure this is the case. -- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code	2012-11-01 15:34:03 -04:00
Guillermo del Angel	24e6da25cc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 14:17:41 -04:00
Eric Banks	96344c6b62	Add note to realigner docs	2012-10-31 12:35:45 -04:00
Guillermo del Angel	4580e99c0c	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 10:50:54 -04:00
Guillermo del Angel	02b790c8db	Merge fix	2012-10-31 10:50:36 -04:00
Guillermo del Angel	51a9ce28e1	Merge remote-tracking branch 'unstable/master' into develop	2012-10-31 10:29:48 -04:00
Eric Banks	e1e480a0b9	Bug fix: don't add no-call alleles to the list of ALT alleles being validated.	2012-10-30 14:54:29 -04:00
Guillermo del Angel	c8e17a7adf	totally experimental UG feature, to be removed	2012-10-30 13:57:54 -04:00
Eric Banks	c95e893920	Better error message for unused ALT alleles	2012-10-29 21:51:35 -04:00
Eric Banks	b6a1967f12	Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it.	2012-10-29 21:47:09 -04:00
Eric Banks	be902375ac	'Bug' fix: fix the error message from the vcf validator so people realize that the file fails strict validation but still adheres to the spec.	2012-10-29 16:29:27 -04:00
Ryan Poplin	4e661847b2	DelocalizedBaseRecalibrator becomes the BaseRecalibrator.	2012-10-29 12:53:39 -04:00
Eric Banks	ac99437eec	Bug fixes to hapmap conversion in VariantsToVCF	2012-10-29 01:45:33 -04:00
Andrey Sivachenko	f3ac5d404d	updating vcf header attribute descriptions in order to reflect correctly what's actually being written...	2012-10-26 23:52:21 -04:00
Andrey Sivachenko	b4fbf6280a	fixing missing sample genotype bug, missing AD/DP bug, and putting annotations in more natural order (Ref/Alt)	2012-10-26 23:48:40 -04:00
Mark DePristo	ac5e58a265	Bugfix for GSA-540 / Update metadata maps when adding lines to VCFHeader -- https://jira.broadinstitute.org/browse/GSA-540 -- http://gatkforums.broadinstitute.org/discussion/1433/possible-bug-and-fix-in-java-code-of-vcfheader-org-broadinstitute-sting-utils-codecs-vcf-vcfheader	2012-10-26 16:34:16 -04:00
Mark DePristo	fa9b2a91d0	Bugfix for GSA-552 -- https://jira.broadinstitute.org/browse/GSA-552 -- User reports a null exception while using VariantsToVCF: http://gatkforums.broadinstitute.org/discussion/1461/nullpointerexception-converting-vcf3-to-vcf-using-variantstovcf The problem is that he left out an input VCF file for the --variant argument and the command-line argument parsing code didn't catch this, so we NPE out later on.	2012-10-26 16:34:16 -04:00
Eric Banks	f66d812778	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 13:20:41 -04:00
Eric Banks	a8704ca73f	Adding TODO notes for Ami	2012-10-26 13:20:27 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	b06f689d4b	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 02:13:26 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00
Mark DePristo	6b8b7df651	Queue now understands -nct and requests the appropriate number of cores from LSF, SGE, etc -- NCT wasn't previously recognized by Queue as needing more processors per machine. This commit fixes this. Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.	2012-10-25 17:26:58 -04:00
David Roazen	422e16c62e	BaseRecalibration: don't cache instances of ReadCovariates across reads Caching and reusing ReadCovariates instances across reads sounds good in theory, but: -it doesn't work unless you zero out the internal arrays before each read -the internal arrays must be sized proportionally to the maximum POSSIBLE recalibrated read length (5000!!!), instead of the ACTUAL read lengths By contrast, creating a new instance per read is basically equivalent to doing an efficient low-level memset-style clear on a much smaller array (since we use the actual rather than the maximum read length to create it). So this should be faster than caching instances and calling clear() but slower than caching instances and not calling clear(). Credit to Ryan to proposing this approach.	2012-10-25 17:02:55 -04:00
Guillermo del Angel	92fa7e953a	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 16:33:14 -04:00
Ami Levy Moonshine	dde3060bb8	add the CEUtrio best practices results (UG + PBT) to the bundle	2012-10-25 15:36:17 -04:00
Ami Levy Moonshine	90b9971033	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 15:32:29 -04:00
David Roazen	884d031e72	NestedIntegerArray: Pre-allocate only the first two dimensions It turns out that pre-allocating the entire tree was too expensive in terms of memory when using large values for the -mcs and -ics parameters. Pre-allocating the first two dimensions prevents us from ever locking the root node during a put(). Contention between threads over lower levels of the tree should be minimal given that puts are rare compared to gets. Also output dimensions and pre-allocation info at startup. If pre-allocation takes longer than usual this gives the user a sense of what is causing the delay.	2012-10-25 15:17:42 -04:00
Mark DePristo	cc8c12b954	Committing a broken version of BaseRecalibration -- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA. -- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation -- Updating GATKPerformanceOverTime with the new @ClassType annotation	2012-10-25 14:46:35 -04:00
Eric Banks	e93ff3ea6e	Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore.	2012-10-25 12:45:23 -04:00
Guillermo del Angel	a838653822	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 10:35:58 -04:00
Guillermo del Angel	596c1723ae	Hidden, unsupported ability of VariantEval to run AlleleCount stratification on sites-only VCFs. I'll expose it/add tests on it if people think this is generaly useful. User needs to specify total # of samples as command line argument since genotypes are not available. Also, fixes to large-scale validation script: lower -minIndelFrac threshold or else we'll kill most indels since default 0.25 is too high for pools, fix also VE stratifications and add one VE run where eval=1KG, comp=pool data and AC stratification based on 1KG annotation	2012-10-25 10:35:43 -04:00
Eric Banks	6dc7d872ec	Fix GenotypeAndValidate to handle SNPs and indels as reported on the forum. Recent changes to the UnifiedArgumentCollection made this stop working. Adding in JIRA to create integration tests for this tool.	2012-10-25 10:06:13 -04:00
Eric Banks	df9e0b7045	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 02:49:54 -04:00
Eric Banks	72714ee43e	Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging.	2012-10-25 02:47:42 -04:00
Eric Banks	c6b57fffda	Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing.	2012-10-24 22:52:25 -04:00
Ami Levy Moonshine	bcf3582095	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-24 21:50:41 -04:00
Eric Banks	9da7bbf689	Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only.	2012-10-24 15:49:07 -04:00
David Roazen	d9aa9855f8	Better comments in NestedIntegerArray	2012-10-24 15:29:13 -04:00
David Roazen	02018ca764	Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable The old BaseRecalibrator walker is and never will be thread-safe, since it's a LocusWalker that uses read attributes to track state. ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1 for testing purposes, but wait for further testing to be done before using it for production purposes in multithreaded mode.	2012-10-24 15:22:50 -04:00

... 5 6 7 8 9 ...

3232 Commits (0f5bb706ffd3e02a2fec45f32317f29e4e45cb6d)