gatk-3.8

Commit Graph

Author	SHA1	Message	Date
carneiro	89bb21d024	typo in the argument description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5587 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 19:45:32 +00:00
rpoplin	3f3f35dea0	UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 19:06:24 +00:00
ebanks	1aa4083352	Fortunately this code isn't used by anyone right now, but it needs to be fixed before someone unwitingly does: flags were wrong according to the SAM spec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5584 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 17:16:41 +00:00
hanna	b231a40da5	Augment PrintLocusContextWalker with extended event info. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5583 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 13:42:48 +00:00
aaron	ab5c4064ed	quick bug fix for variant context utils: only calculate the max AC if we're using the mergeInfoWithMaxAC flag, and if so deal with sites that have multiple alternate alleles correctly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5582 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 05:36:52 +00:00
rpoplin	cc713f2769	fixing exception text git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5581 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 00:29:13 +00:00
ebanks	4b451314b2	Only store a read in the mate hash if it could possibly be moved. This reduces memory consumption especially when dealing with a case of tons of unmapped reads at the end of the bam; however, it's only mildly helpful for chr1 of the Papuans (there's a truly massive pileup 120Mb into it; more thought needed at a later point). Integration tests changed only because some of the reads in the original bam were busted to begin with (it's an old pilot 1000G bam). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5580 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 22:20:09 +00:00
chartl	79b5fa6cc5	Structural refactoring in advance of dichotomization statistics; generalization of statistical test infrastructure. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5579 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 18:52:32 +00:00
asivache	77ca4eef31	IntelliJ complains that @Override is not allowed when implementing interface methods. Whatever. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5578 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 16:57:59 +00:00
ebanks	f4c06bb4ce	Traversal now says 'done with mapped reads' instead of 'done' so we don't confuse users when there are a lot of unmapped reads left to process. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5577 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 15:11:28 +00:00
fromer	5eccc7e528	Added annotation of INCORRECT SNP-based aa annotations in case of MNPdependentAA:true git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5576 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 02:46:45 +00:00
chartl	a0d096c993	Forgot an import statement git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5574 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 22:55:00 +00:00
chartl	b52c3e7e30	Make the window and slide-by values command-line accessible, and standardize for every context. Move the test classes (which are abstract association context modules) into the proper directory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5573 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 22:37:12 +00:00
droazen	a5acb0b7a6	Fix for bug GSA-314: Detect -XL and -L incompatibility. An ArgumentException is now thrown if the combination of -L and -XL intervals specified on the command line results in an empty interval set after set subtraction. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5571 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 18:41:55 +00:00
carneiro	b722ebf244	quick help/comments updates to match the wikipage. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5569 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 12:55:55 +00:00
rpoplin	96f0f0d706	Fixing use of String != String git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5568 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 01:12:00 +00:00
rpoplin	b2a0331e2d	Pushing hard coded arguments into VariantRecalibratorArgumentCollection git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5566 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 19:55:09 +00:00
rpoplin	79c43845ad	Changing Uniform approximation to Normal approximation in rank sum test. n factorial was overflowing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5565 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 18:18:39 +00:00
depristo	b316c9a590	Renamed StratifyAlignmentContext to AlignmentContextUtils, and StatiefyContextType to ReadOrientation. Also, went through the system and deleted all references to second bases. That ship passed long ago. This was the actual commit, the last was an intellij error git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5564 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 15:36:17 +00:00
depristo	5cca100aea	Eliminated the redundant StratifiedAlignmentContext, which previously just held a ReadBackedPileup, and made all of the class methods here just static functions. Far more logical organization, and avoided O(N) endless copying of data for the COMPLETE context. Many tools have been trivially reorganized to take an alignment context now. Everything passes integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5562 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 14:20:43 +00:00
rpoplin	98798eb276	Adding ReadPos rank sum test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5560 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:28:41 +00:00
rpoplin	09e89c8c97	Adding ReadPos rank sum test. Transitioned rank sum tests over to using Chris's implementation in order to harmonize the codebase. There isn't any reason to have competing implementations of rank sum. Thanks to Chris for adding the necessary hypothesis testing options. WilcoxonRankSum.java will be deleted soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5559 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:26:35 +00:00
depristo	11822da578	Stand alone, GATK dependent tool that Reads a list of BAM files and slices all of them into a single merged BAM file containing reads in overlapping chr:start-stop interval. Highly efficient when working with thousands of BAM files. Can merge 1MB of sequence of 1600 4x BAMs in 4g in only 2 hours. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5558 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 13:41:29 +00:00
fromer	27bfec785e	Some walkers for printing FASTA of reference for bed ROD, and "inverting" a bed file (finding regions not covered in bed) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5554 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 21:13:51 +00:00
droazen	0927b7c297	Fix for bug GSA-441: BAM file list with blank lines gives a confusing error message. Lines containing only whitespace in .list files are now ignored. Also added support for comments in .list files: lines whose first non-whitespace character is '#' are now also ignored. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5550 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 15:04:35 +00:00
kshakir	4f8411f4b5	Revved Picard to access new flag to disable mmap for bam indices. Only added a 3% speed boost but the mmap was added to the heap count, making it harder to specify/restrict the total resident memory size in LSF. Specifying -Xmx4g will now stay much closer to 4g resident memory usage versus bumping up to 9g when accessing 900 x ~8Mb bai's. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5549 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 01:40:41 +00:00
asivache	df53351b0f	Get rid of score cutoff at 0 in the alignment matrix (i.e. score[cell] = max(0, score[from_parent_cells]). Use the computed score as is. Technically, it's pretty much NW now, not SW. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5548 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 00:11:04 +00:00
carneiro	0a772688fe	implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced! QuickCCTest is my test script for the gatherer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:15:21 +00:00
carneiro	dac1309dbd	Added two modes for selecting variants at random (random sampling). -number N -- generates a VCF with exactly N randomly chosen variants with equal probability. -fraction F -- generates a VCF with approximately F (between 0-1) randomly chosen variants with equal probability. (Similar behavior to RandomlySplitVariants walker). The reason for two modes is that the first one may need a lot of memory if your sample size is too large. The wiki is being updated with this information now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5545 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:12:40 +00:00
carneiro	8a3b7d88aa	It was returning 1 when it should return 0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5544 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 20:50:38 +00:00
depristo	c7445a6fbd	Now that logging is so standard, only prints messages about logging to DEBUG. Also, found a way to silence the mime.types warning, that doesn't matter at all to us. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5543 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 16:49:39 +00:00
droazen	7b452ea2b9	Fix for bug GSA-430: Can't specify same BAM file twice on the command line. An ArgumentException with an appropriate error message and a list of the duplicate BAMs is now thrown in this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5542 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:23:24 +00:00
hanna	deab9f0aa5	Initial work on proto-shard merger: - create size() method that returns an approximation of the uncompressed size in bytes of BAM span. I'll use this method as a protoshard weighting function until we determine how to normalize the weights across the different data access mechanisms (reads, reference, RODs). - Implementations of basic union/intersection/subtraction mechanisms for BAM spans; should be enough to get an accurate weight for two proto-shards put together. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5541 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:03:43 +00:00
chartl	328f89f66a	Minor changes to MannWhitneyU: - Comment fixes to better explain why two-sided test wants to use the LOWER (not higher) value for U - Much more direct testing of MWU functions - Uniform approximation was always using the < cumulant (sometimes the > cumulant should be used instead) - Uniform approximation currently not used (regime in which it was being used was not the right one -- not necessarily bad, but not an improvement over normal) + this particular approximation is for major imbalances of the form m >> n. Code may be altered in the future to use this method for this particular regime, if the method's not too slow. - Hook into one-sided test. RegionalAssociationRecalibrator: NaNs were being caused by presence of Infinity and -Infinity values out of the walker. Currently I'm just re-setting them to arbitrary post-whitened values, but the walker will be changed to prevent output of these values, and the "fix" will undone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5539 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 17:03:02 +00:00
chartl	fff11a3279	No more pesky NaNs for norms ( HINT::: ((double) x) == Double.NaN is NOT (somehow) the same as Double.compare(x,Double.NaN) == 0). Effectively reverse sorting by changing (rank/size) to ((size-rank)/size). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5538 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 22:43:24 +00:00
carneiro	5d26c66769	Count Covariates is almost scatter-gatherable now! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 22:25:33 +00:00
rpoplin	5ddc0e464a	Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 21:04:09 +00:00
carneiro	0f4ace0902	fixed a bug when the concordance track doesn't have the sample in the variant track. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5535 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 18:24:19 +00:00
chartl	f6dfdc7f3b	Single-tailed hypothesis testing in MWU git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5533 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 15:53:40 +00:00
hanna	8ae14793f2	Small standalone utility to aggregate BGZF block statistics in a BAM file. Works in the same coordinate space as BAM chunks, so this will be used to calibrate chunk weighting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5531 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 22:25:45 +00:00
chartl	f3e4c24f63	Framework works properly now, but whitening still has a kink which is that the covariance matrix gets re-sorted automatically by the eigendecomposition, so somehow the association between eigenvalue and dimension (e.g. association track) needs to be maintained throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5530 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 22:22:37 +00:00
chartl	4c04c5a47a	Addition of a BedTableCodec to allow for parsing of Bed-formatted tables (e.g. bedGraphs). Fixes for the recalibrator. Implementation of the data whitening input. Some TODOs in the RAW. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5529 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 21:35:09 +00:00
corin	f2d84bf746	Changes the validity declaration from a true to false to a five point scale git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5527 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 18:31:53 +00:00
depristo	cd8321cdc9	Removed the completely unused generic but extremely expensive infrastructure for dynamic LocusIteratorFilters. Now the one, and probably only useful one, is called directly in the LocusIteratorByState itself to filter adaptor bases from reads. This shaves 10% off the runtime of all walkers, apparently. Has the additional benefit of eliminating a lot of complex infrastructure that resulted ultimately in only a single function call. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5525 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 20:48:24 +00:00
depristo	231d095316	A clean, fast way to compute fragment pileups. Now consumes no CPU time at all. Ready for general use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5524 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 14:26:29 +00:00
depristo	6a1d12cf7b	Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 12:42:22 +00:00
depristo	3bcd4c5d75	--simplifyBAM is now in the SAMFileWriterArgumentTypeDescriptor, as suggested by map. PrintReads has an integrationtest now that writes out a 1 MB bit of HiSeq normally, with compress 0, and with simplifyBAM on. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5521 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 14:57:18 +00:00
hanna	28ae53d796	Merging the best parts of Mark's fix for the O(n^2) algorithm and my concurrently-written fix for the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5520 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 13:32:23 +00:00
depristo	d8fbda17ab	O(N^2) bug found and removed -- very subtle and hard to find. ArrayLists underlying read backed pileups were being initialized with size() from the entire pileup up all samples, not the sample-specific sizes. So in 1000 samples at 4x, we were creating 1000 x 4000 element array lists, instead of 1000 x 4x element array lists. This fix results in a 2-3x speedup for 900 sample calling, and moves UG.map() back into the main CPU cost of UG with many samples. 900 samples in a single BAM: Release: 64.29 With sample-specific size: 24s - 35s git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5519 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 12:38:19 +00:00
depristo	27c8fb1e4d	Added support for a general GATK option --simplifyBAM to automatically remove and simplify kept reads in an output BAM file. Specifically, duplicate, non-PF, and unmapped reads are removed, and all extended tags in the retained SAM records are removed except the RG:Z tag. This option is very useful when creating temporary BAM files (merged per-population or multi-sample cleaned) for future calling (as in the 1000G processing pipeline). Results in a significant reduction in space of the resulting BAM, faster reading of the BAM, and surprisingly even faster UG performance: 1-10mb of chromosome one, from NA12878 HiSeq 64x data set on hg18: Full BAM Write time: 8.6 m Size: 866M CountReads time: 2.9 m UG time: 11.3 m Simplified BAM: Write time: 6.2 Size: 458M CountReads time: 85.7 s UG time: 10.1 m git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5517 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 01:21:35 +00:00
kshakir	fc8acd503e	Enabled the parameterize option for debugging PipelineTest MD5s. Fixed escaping expressions that have more than one space between arguments. Updated example to match the wiki. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5516 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 00:41:47 +00:00
chartl	fe7f45ee2e	First pass at recalibrating associations, with optional data whitening. Modification to the TableCodec so it can natively read bedgraph files (just needed to add an extra header marker: "track"). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5515 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 19:35:39 +00:00
hanna	ac39f5532e	Turn off index caching. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5514 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 18:48:23 +00:00
hanna	8d8aed6a67	Fix correctness issue when dynamically merging many files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5512 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 16:35:43 +00:00
delangel	c9283e6bc5	Refinement to previous commit: no need to duplicate code to annotate rsID since variantAnnotatorEngine is called from UG anyways. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5511 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 15:00:32 +00:00
delangel	3383733379	Same commit as previous one for VariantAnnotator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5510 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 12:07:18 +00:00
delangel	8701dfe8d3	Hideous, horrible, hairy mutant bug: when we annotate ID field in indels, we were looking for SNP records matching the position, instead of indel records. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5509 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 12:04:08 +00:00
kshakir	3e3ff4a9e7	Bam gathering passes on the compression_level and the create_index flag to MergeSamFiles. VCF gathering passes on the no_header and sites_only flags to CombineVariants. Fixed deletion of gathered log files. Although they are intermediate and do not need to be re-run if not present, they should not be deleted. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5508 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 03:58:38 +00:00
carneiro	47279ee56e	Added --concordance option that outputs the intersection between two VCF files. Useful to see what calls were made in both technologies/algorithms. Wiki has been updated accordingly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5507 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 21:27:16 +00:00
kshakir	e47513f043	Minor updates to match the wiki documentation. Upper cased the PartitionType enum values. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5506 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 20:22:23 +00:00
kshakir	f3e94ef2be	Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output. JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar. JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar. Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath. Walkers from the GATK package are now also embedded into the Queue package. Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP. Removed the GATK jar argument from the example QScripts. Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts: 1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers. 2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3 Removed other unused code. Re-fixed dry run function ordering. Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 14:03:51 +00:00
ebanks	18271aa1f4	It never fails to amaze me that aligners can find so many different ways to place indels off the ends of contigs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5503 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 04:17:23 +00:00
ebanks	48b15d42e0	More fixes and improvements. We no longer use any bases under Q20 because random ~Q5s were cluttering the graphs; instead we grab any contiguous segments of size at least MIN_SEQUENCE_LENGTH where all bases are above Q20. Also, I implemented a quick algorithm to traverse the graph (using DFS) to choose the two best scoring paths (haplotypes). Used it successfully at NA12878 HM3 SNP sites to determine whether they are homozygous (no distiction yet between ref and alt) or heterozygous! Indels are the next target. Still have some issues to work out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5502 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 03:51:19 +00:00
hanna	26e3bea76e	Fix for == used to test object equality. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5499 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 18:15:19 +00:00
ebanks	401d1cb97f	Bug fixes plus some debugging code added. Broke out DeBruijnVertex into its own class so that the interface is now cleaner. Still very much a work in progress. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5498 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 17:35:34 +00:00
hanna	37fbf17da8	Finally restored code after accidentally removing three days worth of work: schedule file infrastructure has been restored, and is now a single file. Only the exact bins required for the traversal are stored in the schedule. Very close to being able to merge schedule entries. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5497 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 05:52:40 +00:00
ebanks	ded80e0c57	Trivial change to remove space at the end of the description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5495 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 01:47:46 +00:00
carneiro	3414bccb46	documentation changes to agree with the wiki git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 21:48:49 +00:00
carneiro	28149e5c5e	GenotypeAndValidate version 2, ready to be used. - now it differentiates between confident REF calls and not confident calls. - you can now use a BAM file as the truth set. - output is much clearer now dataProcessingPipeline version 2, ready to be used. - All the processing is now done at the sample level - Reads the input bam file headers to combine all lanes of the same sample. - Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset. - Outputs one processed bam file per sample (and a .list file with all processed files listed) - Much faster, low pass (read Papuans) can run in the hour queue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 20:18:02 +00:00
chartl	687b2e51b4	Switch from togglable wiggle output to togglable bedgraph format. Can be pulled directly into IGV to show the statistics values. I'll need to bug jim to allow value-toggling in a bedgraph, currently 2nd and 3rd columns are just ignored. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5492 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 17:58:53 +00:00
chartl	5a79f16ea4	Fixed an edge case where an exception was thrown if either of the sets was empty for the MWU test. Also altered the output format so U itself is not printed (which though interesting, isn't so useful for recalibration), but rather a value I call V (really the deviation of U from its expectation). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5490 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 16:28:44 +00:00
ebanks	af7f78e8ba	Minor debugging output change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5488 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 12:59:26 +00:00
ebanks	b463faad92	Fixing typo git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5487 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 03:57:11 +00:00
ebanks	1a9e65bcd4	Updating other walkers now that VCC extends from VC git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 03:10:40 +00:00
ebanks	0ee687e49d	For Mauricio: now, even in GENOTYPE_GIVEN_ALLELES mode, the VariantCallContext (which now inherits directly from VC) will report reference calls as confidently called if they pass the threshold even if the QUAL of the record itself is low because we were forced to have an ALT allele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5485 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 02:42:28 +00:00
ebanks	ab6a815184	As per the comments in the commit itself: when reads get mapped to the junction of two chromosomes (e.g. MT since it is actually circular DNA), their unmapped bit is set, but they are given legitimate coordinates. The Picard code will come in and move the read all the way back to its mate - which can be arbitrarily far away and cause records to be written out of order. Very evil. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5484 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-21 20:30:24 +00:00
ebanks	d9202f2764	Don't try to create a GenomeLoc from an unmapped read git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5480 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-21 13:46:55 +00:00
ebanks	1c95208e26	Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 21:28:19 +00:00
ebanks	9568c84af9	Don't output these messages in INFO mode because they are scaring people unnecessarily git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5475 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:55:22 +00:00
depristo	22ff2573d5	Removed MAG entirely git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5474 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:43:23 +00:00
kiran	55897631ad	Initial attempt at identifying potentially interesting variants in a Mendelian disease context when the called genotypes are uncertain. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5473 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:41:35 +00:00
kshakir	b2b8a4f19f	Re-un-final'ed BAQ.MAG as it was pre r5469. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5472 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:40:31 +00:00
asivache	1d5326ff0c	Minor fixes to the cmd-line help messages git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5470 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 18:18:04 +00:00
depristo	7857cb5a22	Waiting to go to the hospital -- fixed a bug in the BAQ calculation where the BAQ would NPE if a read had no usable bases (all clipped, for example) but didn't fail the PF filter git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5469 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 17:45:21 +00:00
fromer	e84a27ceea	OverlapWithBedInIntervalWalker calculates the average per-input-interval coverage by the BED intervals track git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5468 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 17:44:46 +00:00
depristo	abc7d1aef9	BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site. ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation. Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field. This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle. Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power. The bootstrap sites can be written to a separate VCF for assessment as well. Finally, my general Queue script for creating and evaluating reference panels from VCF files. Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel. Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 03:08:38 +00:00
depristo	9b8d41160b	GENOTYPE_GIVEN_ALLELES now respects the filter status of the incoming alleles file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5466 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 02:59:28 +00:00
depristo	6281c1db6f	A nicer error (UserException now) for malformed genome locs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5465 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 02:58:29 +00:00
delangel	b45afe5ba8	Several major fixes and changes to new indel likelihood model: a) Scrapped the way in which we constructed candidate haplotypes because it wasnt fully correct and yielded corner conditions with incorrect genotyping and likelihood computation. Ideally, a haplotype should "cover" the read and the most likely alignments should be such that the ends of the read are inside the ends of the haplotype. This wasn't happening, and if you have a "dangling read off a haplotype" the probabilistic alignment model may prefer to shift a read instead of scoring it correctly - this is especially bad with tandem repeat insertions. So now, we build haplotypes based on the reference context and adaptively change them based on read alignment positions, plus some padding and uncertainty in the alignment. b) Changed the way soft clipped based are dealt with. Instead of either ignoring them or using them, we only use them if the read start or end position (after soft clipping) are within eventDistance of the current location. This is done because it's very common that BWA's strictly local SW implementation will soft clip every single read at an insertion position because it couldn't place that end of the read without too many mismatches, but the read is legit and the bases are good quality. If we don't take these bases into consideration, reads which are informative of an insertion event are essentially discarded because the informative part is clipped away. c) Several cleanups and fixes to the context-dependent gap penalty model based on length of HRun. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5464 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:39:31 +00:00
depristo	cd38dfb4ef	Now with a clearer, grammatically correct message git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5462 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:06:05 +00:00
depristo	10466dc7d1	I finally broke down and added a default documentation string to @Input for use in Queue scripts. It's not ideal, but I couldn't take any more queue scripts with doc="x" all over the place. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5461 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:05:25 +00:00
depristo	c1798a7dbc	Whitespace cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5460 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:04:08 +00:00
corin	30237e6824	Updated the walker to specify the build based on the user's input file name if the user does not specify the build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5459 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 17:49:17 +00:00
carneiro	3de300e504	A walker that moves annotations from the filter field to the info field of truth annotated vcfs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5458 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 17:11:28 +00:00
ebanks	481750cbf9	Probable patch to Jerry Glenn's GetSatisfaction report. I'm having him test it out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5456 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 16:00:50 +00:00
ebanks	3eea6e92b7	An extremely basic implementation of a deBruijn-based local assembler, using the jgrapht graph library. This is not at all optimized and has only been tested on my very simple 3-read test bams. I'm sure there are bugs in there - more testing coming soon. Insertions and deletions confirmed to generate identical graphs (except for the multiplicity of edges of course). Not worth using yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5455 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 14:03:07 +00:00
hanna	28a5a177ce	Very crude implementation of writing BAM 'schedules' to disk rather that 'meta- indexes'. Not yet elegant, but proves that it circumvents the performance issues associated with the meta-index. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5454 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 21:48:47 +00:00
rpoplin	8d0880d33e	Misc cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5453 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 17:33:19 +00:00
rpoplin	c6ef6ee8b7	Recal file is in input to ApplyRecalibration not an output. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5452 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 12:08:58 +00:00
rpoplin	8e89ff170e	Can't check substitution type of tri-allelic SNPs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5451 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 03:06:03 +00:00

1 2 3 4 5 ...

4232 Commits (50e86cfee979ae7763aee0fc543f05bb653f8205)