Commit Graph

11699 Commits (5003deafb69831c1f40eb7c83fd0ea496e1bd181)

Author SHA1 Message Date
Mauricio Carneiro 5003deafb6 Fixing split-reads unit tests
The new implementation calls for the number of bases to chop, not the chop index anymore, so 0 is no longer appropriate.
2013-01-27 23:38:46 -05:00
Mauricio Carneiro 1aee8f205e Tool to calculate per base coverage distribution
GSATDG-29 #resolve
2013-01-27 23:38:46 -05:00
Mark DePristo 63913d516f Add join call to Progress meter unit test so we actually know the daemon thread has finished 2013-01-27 16:52:45 -05:00
Mark DePristo f5473285d5 Update CountReadsInActiveRegions md5 2013-01-27 14:35:55 -05:00
Mark DePristo 14d8afe413 Remove startSearchAt state variable from ActivityProfile
-- New algorithm will only try to create an active region if there's at least maxREgionSize + propagation distance states in the list.  When that's true, we are guaranteed to actually find a region.  So this algorithm is not only truly correct but as super fast, as we only ever do the search for the end of the region when we will certainly find one, and actually generate a region.
2013-01-27 14:10:08 -05:00
Mark DePristo c97a361b5d Added realistic BandPassFilterUnitTest that ensures quality results for 1000G phase I VCF and NA12878 VCF
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions.  This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined.  This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
2013-01-27 14:10:08 -05:00
Mark DePristo 72b2e77eed Linearize the findEndOfRegion algorithm in ActivityProfile, radically improving its performance
-- Previous algorithm was O(N^2)
-- #resolve GSA-723 https://jira.broadinstitute.org/browse/GSA-723
2013-01-27 14:10:06 -05:00
Mark DePristo 0fb238b61e TraverseActiveRegions Optimizations and Bugfixes: make sure to record position of current locus to discharge active regions when there's no data
-- Now records the position of the current locus, as well as that of the last read.  Necessary when passing through regions with no reads.  The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found
-- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information
2013-01-27 14:10:06 -05:00
Mark DePristo 804caf7a45 HaplotypeCaller Optimization: return a inactive (p = 0.0) activity if the context has no bases in the pileup
-- Allows us to avoid doing a lot of misc. work to set up the genotype when we don't have any data to genotype.  Valuable in the case where we are passing through large regions without any data
2013-01-27 14:10:06 -05:00
Mark DePristo 93d88cdc68 Optimization: LocusReferenceView now passes along the contig index to createGenomeLoc, speeding up their creation
-- Also cleaned up some unused methods
2013-01-27 14:10:06 -05:00
Mark DePristo 52a28968a9 ART optimization: BandPassActivityProfile only applies the gaussian filter if the state probability > 0 2013-01-27 14:10:06 -05:00
Mauricio Carneiro 705cccaf63 Making SplitReads output FastQ's instead of BAM
- eliminates one step in my pipeline
   - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.
2013-01-27 02:36:31 -05:00
Mauricio Carneiro ae38cf3f72 Adding read directionality to SplitReads
- directionality only influences 'chop' operation (since split will maintain all bases of the original read)
   - added directional unit test

GSATDG-25 #resolved
2013-01-26 22:25:56 -05:00
Mauricio Carneiro 6ea7133d95 Updating licenses of latest moved files 2013-01-26 13:46:52 -05:00
Mauricio Carneiro ef4cc742e5 Fixing the licensing scripts
- Fixed shell glob limitation that was failing license updates on big commits
	- Hook will now force user to re-commit after updating the licenses (pre-commit hook can't update and commit in the same process)
	- Moved all scripts to bash/zsh
	- Separated the license utilities in a separate python module to avoid copying code

GSATDG-28 #resolve
2013-01-26 13:42:49 -05:00
Mauricio Carneiro e7c9e3639e Making metrics a required parameter in MarkDuplicates
As requested by user (forum)
2013-01-25 17:49:49 -05:00
Ami Levy-Moonshine 99cb8d68e9 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 16:07:38 -05:00
Mark DePristo b8c0b05785 Add contract to ensure that getAdapterBoundary returns the right result
-- Also renamed the function to getAdaptorBoundary for consistency across the codebase
2013-01-25 16:05:17 -05:00
Mark DePristo e445c71161 LIBS optimization for adapter clipping
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine f50db01742 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 15:55:56 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Khalid Shakir c58e02a3bd Added a QFunction.jobLocalDir for optionally tracking a node local directory that may have faster intermediate storage, with SGF ensuring that if the directory happens to be on the same machine that it get's a clone specific sub-directory to avoid collisions. 2013-01-25 14:28:04 -05:00
Ami Levy-Moonshine fc22a5c71c Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 11:47:38 -05:00
Ami Levy-Moonshine eaf6279d48 adding RBP to the general calling pipeline and few other small changes to it (to make it run with the current bundel file names 2013-01-25 11:47:30 -05:00
Mark DePristo 3f95f39be3 Updating HC md5s for new cutting algorithm and default band pass filter parameters 2013-01-25 11:07:29 -05:00
Mark DePristo 008b617577 Cleanup the getLIBS function in LocusIterator
-- Now throws an UnsupportedOperationException in the base class.  Only LocusView implements this function and actually returns the LIBS
2013-01-25 11:07:28 -05:00
Eric Banks f7b80116d6 Don't let users play with the different exact model implementations. 2013-01-25 10:52:02 -05:00
Eric Banks 6dd0e1ddd6 Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants.
This allows us to move SelectVariants into the public suite of tools now.
2013-01-25 09:42:04 -05:00
Mark DePristo c7a29b1d39 Fixed NPE in ActiveRegionUnitTest by allowing null supporting states in ActiveRegion 2013-01-24 13:48:00 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Mark DePristo c96b64973a Soft clip probability propagation is capped by the MAX_PROB_PROPAGATION_DISTANCE, which is 50 bp 2013-01-24 13:48:00 -05:00
Mark DePristo 0c94e3d96e Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp
-- Previously we allowed band pass filter size to be specified along with the sigma.  But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel.  This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
2013-01-24 13:47:59 -05:00
Mark DePristo 9e43a2028d Making band pass filter size, sigma, active region max size and extension all accessible from the command line 2013-01-24 13:47:59 -05:00
Mark DePristo cd91e365f4 Optimize getCurrentContigLength and getLocForOffset in ActivityProfile 2013-01-24 13:47:59 -05:00
Eric Banks 26ef400f85 More reviews 2013-01-24 13:20:12 -05:00
Eric Banks 6790e103e0 Moving lots of walkers back from protected to public (along with several of the VA annotations).
Let's see whether Mauricio's automatic git hook really works!
2013-01-24 11:42:49 -05:00
Mauricio Carneiro 9e003b3296 more updates to the licensing scripts 2013-01-24 00:04:27 -07:00
Mauricio Carneiro e1c1a4de4c Moving licensing scripts to bash instead of tcsh 2013-01-23 22:59:44 -07:00
Mauricio Carneiro 42b056e8ea Forgot the unit test. 2013-01-23 21:18:27 -07:00
Mauricio Carneiro 36c7c418e6 Adding the licenses to the files 2013-01-23 21:15:06 -07:00
Mauricio Carneiro 243fcde840 Adding license to SplitReads
I got caught !
2013-01-23 21:12:36 -07:00
Mauricio Carneiro 643a508564 Added atlassian intellij plugin file to .gitignore 2013-01-23 20:55:28 -07:00
Mauricio Carneiro a4fbf9df1e SplitReads walker implementation (for AGBT talk)
- walker simulates sequencing with different lengths to evaluate mapping/alignment biases relative to read length
   - split : splits reads n-ways generating 2^n reads for each read of the same length.
   - chop : chops the right end tail of the read creating 1 smaller read as if the sequencer stopped short.
   - mate information is preserved for chopped reads, and re-indexed for split reads so that each split still points at the corresponding split on the mate.
   - added systematic unit tests

GSATDG-23
2013-01-23 20:55:28 -07:00
Chris Hartl a3b98daf1a Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-23 14:49:34 -05:00
Chris Hartl 7fcfa4668c Since GenotypeConcordance is now a standalone walker, remove the old GenotypeConcordance evaluation module and the associated integration tests. 2013-01-23 14:47:23 -05:00
Mauricio Carneiro fc54a5da55 Adding the new bash script
GSATDG-9
2013-01-23 12:14:34 -07:00
Mauricio Carneiro 6588b4bacd tcsh -> bash
David is convinced that the error is because i'm using tcsh instead of bash. Let's see if he's right :-)

GSATDG-9
2013-01-23 12:10:34 -07:00
Mauricio Carneiro 8e8993da27 oops... forgot to change sys.argv to filename
GSATDG-9
2013-01-23 12:01:06 -07:00
Mauricio Carneiro 820bec5572 Dropping xargs
- continuing the effort to reduce blob size

GSATDG-9
2013-01-23 11:54:20 -07:00
Mark DePristo ee8039bf25 Fix trivial call in unit test 2013-01-23 13:51:58 -05:00