gatk-3.8

Commit Graph

Author	SHA1	Message	Date
hanna	b231a40da5	Augment PrintLocusContextWalker with extended event info. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5583 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 13:42:48 +00:00
aaron	ab5c4064ed	quick bug fix for variant context utils: only calculate the max AC if we're using the mergeInfoWithMaxAC flag, and if so deal with sites that have multiple alternate alleles correctly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5582 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 05:36:52 +00:00
rpoplin	cc713f2769	fixing exception text git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5581 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 00:29:13 +00:00
ebanks	4b451314b2	Only store a read in the mate hash if it could possibly be moved. This reduces memory consumption especially when dealing with a case of tons of unmapped reads at the end of the bam; however, it's only mildly helpful for chr1 of the Papuans (there's a truly massive pileup 120Mb into it; more thought needed at a later point). Integration tests changed only because some of the reads in the original bam were busted to begin with (it's an old pilot 1000G bam). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5580 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 22:20:09 +00:00
asivache	77ca4eef31	IntelliJ complains that @Override is not allowed when implementing interface methods. Whatever. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5578 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 16:57:59 +00:00
ebanks	f4c06bb4ce	Traversal now says 'done with mapped reads' instead of 'done' so we don't confuse users when there are a lot of unmapped reads left to process. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5577 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 15:11:28 +00:00
fromer	5eccc7e528	Added annotation of INCORRECT SNP-based aa annotations in case of MNPdependentAA:true git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5576 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-05 02:46:45 +00:00
chartl	b52c3e7e30	Make the window and slide-by values command-line accessible, and standardize for every context. Move the test classes (which are abstract association context modules) into the proper directory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5573 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 22:37:12 +00:00
droazen	a5acb0b7a6	Fix for bug GSA-314: Detect -XL and -L incompatibility. An ArgumentException is now thrown if the combination of -L and -XL intervals specified on the command line results in an empty interval set after set subtraction. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5571 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 18:41:55 +00:00
carneiro	b722ebf244	quick help/comments updates to match the wikipage. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5569 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-04 12:55:55 +00:00
depristo	b316c9a590	Renamed StratifyAlignmentContext to AlignmentContextUtils, and StatiefyContextType to ReadOrientation. Also, went through the system and deleted all references to second bases. That ship passed long ago. This was the actual commit, the last was an intellij error git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5564 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 15:36:17 +00:00
depristo	5cca100aea	Eliminated the redundant StratifiedAlignmentContext, which previously just held a ReadBackedPileup, and made all of the class methods here just static functions. Far more logical organization, and avoided O(N) endless copying of data for the COMPLETE context. Many tools have been trivially reorganized to take an alignment context now. Everything passes integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5562 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-03 14:20:43 +00:00
rpoplin	98798eb276	Adding ReadPos rank sum test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5560 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:28:41 +00:00
rpoplin	09e89c8c97	Adding ReadPos rank sum test. Transitioned rank sum tests over to using Chris's implementation in order to harmonize the codebase. There isn't any reason to have competing implementations of rank sum. Thanks to Chris for adding the necessary hypothesis testing options. WilcoxonRankSum.java will be deleted soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5559 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-02 22:26:35 +00:00
fromer	27bfec785e	Some walkers for printing FASTA of reference for bed ROD, and "inverting" a bed file (finding regions not covered in bed) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5554 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 21:13:51 +00:00
droazen	0927b7c297	Fix for bug GSA-441: BAM file list with blank lines gives a confusing error message. Lines containing only whitespace in .list files are now ignored. Also added support for comments in .list files: lines whose first non-whitespace character is '#' are now also ignored. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5550 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 15:04:35 +00:00
kshakir	4f8411f4b5	Revved Picard to access new flag to disable mmap for bam indices. Only added a 3% speed boost but the mmap was added to the heap count, making it harder to specify/restrict the total resident memory size in LSF. Specifying -Xmx4g will now stay much closer to 4g resident memory usage versus bumping up to 9g when accessing 900 x ~8Mb bai's. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5549 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-01 01:40:41 +00:00
carneiro	0a772688fe	implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced! QuickCCTest is my test script for the gatherer. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:15:21 +00:00
carneiro	dac1309dbd	Added two modes for selecting variants at random (random sampling). -number N -- generates a VCF with exactly N randomly chosen variants with equal probability. -fraction F -- generates a VCF with approximately F (between 0-1) randomly chosen variants with equal probability. (Similar behavior to RandomlySplitVariants walker). The reason for two modes is that the first one may need a lot of memory if your sample size is too large. The wiki is being updated with this information now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5545 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 21:12:40 +00:00
depristo	c7445a6fbd	Now that logging is so standard, only prints messages about logging to DEBUG. Also, found a way to silence the mime.types warning, that doesn't matter at all to us. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5543 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-31 16:49:39 +00:00
droazen	7b452ea2b9	Fix for bug GSA-430: Can't specify same BAM file twice on the command line. An ArgumentException with an appropriate error message and a list of the duplicate BAMs is now thrown in this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5542 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:23:24 +00:00
hanna	deab9f0aa5	Initial work on proto-shard merger: - create size() method that returns an approximation of the uncompressed size in bytes of BAM span. I'll use this method as a protoshard weighting function until we determine how to normalize the weights across the different data access mechanisms (reads, reference, RODs). - Implementations of basic union/intersection/subtraction mechanisms for BAM spans; should be enough to get an accurate weight for two proto-shards put together. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5541 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:03:43 +00:00
carneiro	5d26c66769	Count Covariates is almost scatter-gatherable now! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 22:25:33 +00:00
rpoplin	5ddc0e464a	Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 21:04:09 +00:00
carneiro	0f4ace0902	fixed a bug when the concordance track doesn't have the sample in the variant track. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5535 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-29 18:24:19 +00:00
hanna	8ae14793f2	Small standalone utility to aggregate BGZF block statistics in a BAM file. Works in the same coordinate space as BAM chunks, so this will be used to calibrate chunk weighting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5531 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 22:25:45 +00:00
chartl	4c04c5a47a	Addition of a BedTableCodec to allow for parsing of Bed-formatted tables (e.g. bedGraphs). Fixes for the recalibrator. Implementation of the data whitening input. Some TODOs in the RAW. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5529 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-28 21:35:09 +00:00
depristo	cd8321cdc9	Removed the completely unused generic but extremely expensive infrastructure for dynamic LocusIteratorFilters. Now the one, and probably only useful one, is called directly in the LocusIteratorByState itself to filter adaptor bases from reads. This shaves 10% off the runtime of all walkers, apparently. Has the additional benefit of eliminating a lot of complex infrastructure that resulted ultimately in only a single function call. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5525 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 20:48:24 +00:00
depristo	231d095316	A clean, fast way to compute fragment pileups. Now consumes no CPU time at all. Ready for general use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5524 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 14:26:29 +00:00
depristo	6a1d12cf7b	Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-27 12:42:22 +00:00
depristo	3bcd4c5d75	--simplifyBAM is now in the SAMFileWriterArgumentTypeDescriptor, as suggested by map. PrintReads has an integrationtest now that writes out a 1 MB bit of HiSeq normally, with compress 0, and with simplifyBAM on. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5521 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 14:57:18 +00:00
hanna	28ae53d796	Merging the best parts of Mark's fix for the O(n^2) algorithm and my concurrently-written fix for the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5520 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 13:32:23 +00:00
depristo	d8fbda17ab	O(N^2) bug found and removed -- very subtle and hard to find. ArrayLists underlying read backed pileups were being initialized with size() from the entire pileup up all samples, not the sample-specific sizes. So in 1000 samples at 4x, we were creating 1000 x 4000 element array lists, instead of 1000 x 4x element array lists. This fix results in a 2-3x speedup for 900 sample calling, and moves UG.map() back into the main CPU cost of UG with many samples. 900 samples in a single BAM: Release: 64.29 With sample-specific size: 24s - 35s git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5519 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 12:38:19 +00:00
depristo	27c8fb1e4d	Added support for a general GATK option --simplifyBAM to automatically remove and simplify kept reads in an output BAM file. Specifically, duplicate, non-PF, and unmapped reads are removed, and all extended tags in the retained SAM records are removed except the RG:Z tag. This option is very useful when creating temporary BAM files (merged per-population or multi-sample cleaned) for future calling (as in the 1000G processing pipeline). Results in a significant reduction in space of the resulting BAM, faster reading of the BAM, and surprisingly even faster UG performance: 1-10mb of chromosome one, from NA12878 HiSeq 64x data set on hg18: Full BAM Write time: 8.6 m Size: 866M CountReads time: 2.9 m UG time: 11.3 m Simplified BAM: Write time: 6.2 Size: 458M CountReads time: 85.7 s UG time: 10.1 m git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5517 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 01:21:35 +00:00
chartl	fe7f45ee2e	First pass at recalibrating associations, with optional data whitening. Modification to the TableCodec so it can natively read bedgraph files (just needed to add an extra header marker: "track"). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5515 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 19:35:39 +00:00
hanna	ac39f5532e	Turn off index caching. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5514 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 18:48:23 +00:00
hanna	8d8aed6a67	Fix correctness issue when dynamically merging many files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5512 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 16:35:43 +00:00
delangel	c9283e6bc5	Refinement to previous commit: no need to duplicate code to annotate rsID since variantAnnotatorEngine is called from UG anyways. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5511 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 15:00:32 +00:00
delangel	3383733379	Same commit as previous one for VariantAnnotator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5510 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 12:07:18 +00:00
delangel	8701dfe8d3	Hideous, horrible, hairy mutant bug: when we annotate ID field in indels, we were looking for SNP records matching the position, instead of indel records. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5509 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 12:04:08 +00:00
kshakir	3e3ff4a9e7	Bam gathering passes on the compression_level and the create_index flag to MergeSamFiles. VCF gathering passes on the no_header and sites_only flags to CombineVariants. Fixed deletion of gathered log files. Although they are intermediate and do not need to be re-run if not present, they should not be deleted. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5508 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-25 03:58:38 +00:00
carneiro	47279ee56e	Added --concordance option that outputs the intersection between two VCF files. Useful to see what calls were made in both technologies/algorithms. Wiki has been updated accordingly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5507 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 21:27:16 +00:00
kshakir	e47513f043	Minor updates to match the wiki documentation. Upper cased the PartitionType enum values. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5506 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-24 20:22:23 +00:00
hanna	26e3bea76e	Fix for == used to test object equality. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5499 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 18:15:19 +00:00
hanna	37fbf17da8	Finally restored code after accidentally removing three days worth of work: schedule file infrastructure has been restored, and is now a single file. Only the exact bins required for the traversal are stored in the schedule. Very close to being able to merge schedule entries. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5497 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 05:52:40 +00:00
ebanks	ded80e0c57	Trivial change to remove space at the end of the description git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5495 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-23 01:47:46 +00:00
carneiro	28149e5c5e	GenotypeAndValidate version 2, ready to be used. - now it differentiates between confident REF calls and not confident calls. - you can now use a BAM file as the truth set. - output is much clearer now dataProcessingPipeline version 2, ready to be used. - All the processing is now done at the sample level - Reads the input bam file headers to combine all lanes of the same sample. - Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset. - Outputs one processed bam file per sample (and a .list file with all processed files listed) - Much faster, low pass (read Papuans) can run in the hour queue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 20:18:02 +00:00
ebanks	af7f78e8ba	Minor debugging output change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5488 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 12:59:26 +00:00
ebanks	1a9e65bcd4	Updating other walkers now that VCC extends from VC git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 03:10:40 +00:00
ebanks	0ee687e49d	For Mauricio: now, even in GENOTYPE_GIVEN_ALLELES mode, the VariantCallContext (which now inherits directly from VC) will report reference calls as confidently called if they pass the threshold even if the QUAL of the record itself is low because we were forced to have an ALT allele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5485 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-22 02:42:28 +00:00
ebanks	ab6a815184	As per the comments in the commit itself: when reads get mapped to the junction of two chromosomes (e.g. MT since it is actually circular DNA), their unmapped bit is set, but they are given legitimate coordinates. The Picard code will come in and move the read all the way back to its mate - which can be arbitrarily far away and cause records to be written out of order. Very evil. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5484 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-21 20:30:24 +00:00
ebanks	d9202f2764	Don't try to create a GenomeLoc from an unmapped read git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5480 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-21 13:46:55 +00:00
ebanks	1c95208e26	Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 21:28:19 +00:00
ebanks	9568c84af9	Don't output these messages in INFO mode because they are scaring people unnecessarily git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5475 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:55:22 +00:00
depristo	22ff2573d5	Removed MAG entirely git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5474 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 19:43:23 +00:00
depristo	abc7d1aef9	BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site. ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation. Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field. This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle. Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power. The bootstrap sites can be written to a separate VCF for assessment as well. Finally, my general Queue script for creating and evaluating reference panels from VCF files. Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel. Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 03:08:38 +00:00
depristo	9b8d41160b	GENOTYPE_GIVEN_ALLELES now respects the filter status of the incoming alleles file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5466 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-18 02:59:28 +00:00
delangel	b45afe5ba8	Several major fixes and changes to new indel likelihood model: a) Scrapped the way in which we constructed candidate haplotypes because it wasnt fully correct and yielded corner conditions with incorrect genotyping and likelihood computation. Ideally, a haplotype should "cover" the read and the most likely alignments should be such that the ends of the read are inside the ends of the haplotype. This wasn't happening, and if you have a "dangling read off a haplotype" the probabilistic alignment model may prefer to shift a read instead of scoring it correctly - this is especially bad with tandem repeat insertions. So now, we build haplotypes based on the reference context and adaptively change them based on read alignment positions, plus some padding and uncertainty in the alignment. b) Changed the way soft clipped based are dealt with. Instead of either ignoring them or using them, we only use them if the read start or end position (after soft clipping) are within eventDistance of the current location. This is done because it's very common that BWA's strictly local SW implementation will soft clip every single read at an insertion position because it couldn't place that end of the read without too many mismatches, but the read is legit and the bases are good quality. If we don't take these bases into consideration, reads which are informative of an insertion event are essentially discarded because the informative part is clipped away. c) Several cleanups and fixes to the context-dependent gap penalty model based on length of HRun. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5464 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:39:31 +00:00
depristo	cd38dfb4ef	Now with a clearer, grammatically correct message git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5462 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:06:05 +00:00
ebanks	481750cbf9	Probable patch to Jerry Glenn's GetSatisfaction report. I'm having him test it out. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5456 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 16:00:50 +00:00
hanna	28a5a177ce	Very crude implementation of writing BAM 'schedules' to disk rather that 'meta- indexes'. Not yet elegant, but proves that it circumvents the performance issues associated with the meta-index. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5454 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 21:48:47 +00:00
rpoplin	8d0880d33e	Misc cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5453 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-16 17:33:19 +00:00
rpoplin	d98503ca50	Removing some debug code from VQSRv2. VariantEval can now be stratified by contig with -ST Contig. New hidden option in CombineVariants for overlapping records to take the info fields from the record with the highest AC (while still updating AC/AN/AF correctly) instead of dropping info fields which aren't exactly the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5448 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-15 21:28:10 +00:00
carneiro	4b9b767eb1	SelectVariants: now keeps the YAML stuff internal... it's there if you wanna use it, but won't be published anymore. Official parameter is the string for now. VariantEval: now sports the new MendelianViolation utility class. MendelianViolationClassifier: I noticed I had broken chartl's walker by changing VariantEval, so I took the liberty to modify it to use the new library too, though I kept modifications to a minimum, could have gone into full integration if this is a useful tool, but since it's in oneoffs, I decided not to go all out. MendelianViolation: Some getter methods were added for chartl and VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5447 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-15 18:36:55 +00:00
delangel	653fb09bb7	a) Next iteration of context-dependent gap penalty model for new probabilistic alignment indel model. Actual model is now implemented, computes homopolymer run profile for candidate haplotypes and looks up in table gap penalties based on hrun length at each position. Initial penalty model is a very naive affine penalty model with each extra hrun increment decreasing Q2 the gap open penalty, until a minimum is reached. Still needs to be tuned and ideally get data from recalibration. b) small bug fix when setting debug arguments git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5446 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-15 16:46:28 +00:00
carneiro	33c7593218	YAML integrated mendelian violation utility class, integrated and tested through select variants. Wiki is updated. ps: I moved it out of tribble. If you think it should reside in a different place, just yell at me. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5436 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-14 16:43:37 +00:00
hanna	5406e779d2	Ryan noticed that I accidentally killed a public interface method for getting tag information. Reinstated. Proper unit test to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5434 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-14 15:51:19 +00:00
depristo	3e3ec85807	Checked for consistency with the previous integration tests, and updated the walker and test to use the new I/O system (always prints 4 digits on floats. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5433 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-13 15:24:22 +00:00
depristo	b99e27bf9b	In the process of optimizing ProduceBeagleInputWalker, discovered that the GenotypeLikelihoods, the UG, and Genotype objects were using old-style GL tags internally, and then converting from Likelihoods -> GL String -> Likelihoods -> PL String throughout the GATK. It was both painful and led to convoluted code throughout the system. Removed everything but GL conversion -> PL in the GenotypeLikelihoods objects, and now all of the codes in UG now immediately provides GenotypeLikelihoods to the Genotype objects, which is converted straight to PL now. Resulted in a 30% speed up in ProduceBeagleLikelihoods, passes integration tests without any modifications, and likely speeds up writing any VCFs with likelihoods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5432 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-13 00:07:51 +00:00
depristo	d01d4fdeb5	Optimized version of produce beagle tool, along with experimental (hidden) support for combining likelihoods depending on estimate false positive rate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5430 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-12 02:06:28 +00:00
depristo	ee8f2871f7	A better output for Genotype Concordance summary. Now does only % comp hom-ref called hom-ref, het called het, and hom-var called hom-var, which are the quantities we typically show in slides. Updated intergration tests to reflect this change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5429 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-12 02:03:48 +00:00
kshakir	93de326066	Added a new @PartitionBy for walkers to specify how to cut up their inputs. Now building all javadoc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5428 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-12 01:33:08 +00:00
delangel	8ca3390ee0	Low level plumbing work required to have a context dependent error model with the new indel probabilistic alignment model. This just adds an extra input argument and does some refactoring so that when an actual model is ready it will be easy to plug in. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5427 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-12 00:00:55 +00:00
carneiro	e35a67b3cc	changed the name of the parameter to make the wiki more uniform. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5426 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 17:54:53 +00:00
carneiro	4a84a81d17	SelectVariants: added parameters for mendelian violation. Given a trio vcf, it will generate a VCF with the sites that are mendelian violations. GenotypeAndValidate: now annotates the validations with callStatus. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5425 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 17:47:53 +00:00
delangel	b03055099a	a) Changed the way we classify and log indel events (e.g. in IndelClasses table inside IndelStatistics VE module). Made names clearer, and split logging of event length with number of repetitions of event. b) Add an experimental annotation to log indel type string inside the INFO field, just for debugging/temp analysis purposes (will consider making it standard if it proves useful). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5424 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 17:37:41 +00:00
depristo	ccc773d175	Refactoring, cleanup, and performance improvements to ProduceBeagleInput. It's really a shame that there's no integration tests... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5418 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 13:55:30 +00:00
ebanks	4baeb5979f	It turns out that Math.log10() can return 0, which leads to QUALs being set to -0, which is off-spec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5415 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-10 03:08:56 +00:00
ebanks	3596c56602	New attempt at the constrained movement version of the indel realigner (I've kept around the old writer for now). The new contract is that the realigner must ask permission before trying to clean an area; permission will be denied by the CM-Manager if it was required to flush its cache of reads because of too much depth within a distance of maxInsertSizeForMovingReadPairs. Added integration tests to cover different max cache sizes, including an expected exception when too small a value is chosen. The actual logic changes were fairly minor - much of this commit is really just some cleanup. I'd like to throw 1000G Phase I at it, but will respectfully wait for Ryan to hit his deadline before doing so. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5414 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-10 02:48:29 +00:00
rpoplin	ff7edc4493	Minor bug fix in empiricalMu prior calculation in VQSR. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5412 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-10 00:42:38 +00:00
rpoplin	509daac9f7	Minor bug fix in k-means implementation. Updating VQSR integration tests in preparation for VQSRv2 by removing some unused features such as VariantDatum.weight and ti/tv cutting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5410 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-09 00:26:28 +00:00
carneiro	b733cba7c7	re-fixing for a different approach suggested by eric! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5402 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-08 04:54:49 +00:00
carneiro	02006954bc	UG: small bug fix when creating empty variant contexts in UG for the -EMIT_ALL_SITES to allow indels. GAV: First version of the walker that validates reads from a BAM file based on an annotated VCF with TP/FP annotations. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5396 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 22:51:04 +00:00
hanna	9384b2ff65	A few quick fixes to temporarily make the LowMemorySharder return exactly the same shards as the previous sharder, so that I can directly compare filespans to see where some performance bugs lie. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5395 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 22:43:14 +00:00
depristo	0b4e51317b	Now includes project consensus high sensitivity data set git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5394 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 20:52:11 +00:00
carneiro	73e43d8d2c	Added functionality: -disc (--discordance) parameter together with a ROD track will output a VCF with the variants in the ROD track that are not present in the 'variants' VCF. Useful tool to list the variants from hapmap (for example) that weren't called in a dataset. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5392 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 19:18:15 +00:00
delangel	8c262eb605	Initial commit of new likelihood model to evaluate indel quality. Principle is simple, a plain Pair HMM with affine gap penalties (in log space) that does quasi-local alignment between reads and candidate haplotypes and which in theory should be more solid and more reliable than the older Dindel-based model. It also allows to be easily extensible in the future if we decide to introduce either context-dependent and/or read-dependent gap penalties. Model is disabled by default and we're still using the old Dindel model until I'm more confident that new model is a definitive improvement, so right now this is enabled by hidden command line arguments, and it's not to be used yet. In detail: a) Several refactorings to share softMax() available to other modules, so its now part of MathUtils. b) Refactored a couple of read utilities and moved from BAQ to ReadUtils. c) New PairHMMIndelErrorModel class implementing new likelihood model d) Several new hidden debug arguments in UAC. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5389 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 15:31:58 +00:00
depristo	5c979633f0	Due to a problem in the way that dynamic type selection works, I've added an explicit (temporary) ability to restrict VE to specific variant types (SNPs, INDELs, etc), so that calculations will work when a site has a SNP in dbSNP but is called as an indel, causing the SNP site to mysteriously disappear from the comp track, a huge problem for validation report. VEU updated to allow both dynamic type (old) and just returning everything in the track. Also, created a standard Queue script that calculates a suite of standard indel and SNP assessment results. Will be the basis for a general evaluation Queue script with standardized data files for SNPs and Indels. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5385 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-06 19:31:12 +00:00
depristo	2f1e249aed	A proper validation report, calculating TP, FP, FN, sensitivity, FDR, PPV. Treats comp as a set of sites that have been either filtered (failed in assay), validated (polymorphic among samples), or invalidated (AC=0 or all genotypes = hom-ref). Very useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5384 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-06 19:27:40 +00:00
depristo	af71576a07	CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-06 19:25:52 +00:00
depristo	5b8fdc5b1f	Slightly optimized calculation for ~linear exact model, as well as totally incorrect banded calculation, for future development, if this proves useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5382 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-06 18:47:08 +00:00
depristo	9a8356892a	Cleaner error (really now just warnings) if you can't reach the S3 for logging git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5377 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-05 06:08:35 +00:00
hanna	10516f5de4	Fixed one low-memory sharder performance culprit: regions with no BAM data whatsoever were misusing the Picard MergingIterator, triggering a re-traversal through the entire contig. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5376 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-04 21:26:22 +00:00
ebanks	337b54136f	2 fixes. For Mark: when insertions can be partially left-aligned, we were reading off the wrong bases. For GS post: the stored VariantContext.REFERENCE_BASE_FOR_INDEL_KEY needs to be updated when left-aligning because it can change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5375 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-04 21:00:08 +00:00
kiran	b42005e7d7	Fixed issue where comp tracks with genotypes that didn't exactly overlap the eval track were getting dropped. Fixed issue where the 'row' column wasn't being output for things implementing TableType. This is an urgent patch for Mark - it'll break tests until I go back and update the md5s. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5374 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-04 16:51:12 +00:00
kiran	1861ca90fc	A change to the definition of CpG sites (is now, from 5' to 3' a CG dinucleotide in the reference, and the CpG site is at the C, rather than either at the C or a G). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5373 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-04 15:36:07 +00:00
chartl	9ca1dd5d62	Miscellaneous changes: - RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it) - VCFUtils: grabbing VCF headers given a prefix - MathUtils: Useful functions for calculating statistics on collections of Numbers - VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure. - Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored in the window optionally different from those returned by whatever statistic is run across the window Added: - MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation for when one of the sample sets is small, and the other large. Unit tests to follow. - BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations - BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 22:43:36 +00:00
rpoplin	f7ef35b8f5	Removing untrue comments in the GaussianMixtureModel git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5369 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 18:18:47 +00:00
hanna	5d4bbf41fb	Behave intelligently in the deepest levels of GATK record filtration when we find a read flagged as 'mapped' in the unmapped region at the end of the file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5365 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 04:52:55 +00:00
hanna	7a22f19366	More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and ease-of-use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 03:55:24 +00:00
depristo	0181d95fe4	Intermediate optimization checkin. LinearExact model now about 10-20% faster than previous commit, by reorganizing and optimizing the if statements and genotype likelihood calculations. Next commit will include a banded implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5362 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 22:01:35 +00:00
ebanks	f0f4bc3363	This was busted because it assumed 1 (and only 1) record at each position. However it's possible to have 0 (which generated a NullPointer) or 2+ records (which dropped records). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5361 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 21:35:50 +00:00
depristo	c152ef4339	Better error message for unknown reference file extension. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5359 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 17:52:16 +00:00
hanna	bef83b8b09	Bug fix: was tracking state across BAMs that should've been tracked per-BAM. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5358 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 17:32:06 +00:00
depristo	bafa61c1fe	LINEAR_EXACT now the default model. Passes all integration tests. 2-3x faster in low-pass data. Tests on exome data ongoing, but potentially vastly faster there. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5357 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 17:14:36 +00:00
rpoplin	8e1aa6059a	New mode for CombineVariants to assume the incoming VCFs have the same samples and disjoint calls. Drastically reduces the runtime for routine combining operations. Very useful with Queue. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5356 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 15:52:17 +00:00
hanna	5e4b321f86	Add hidden command-line argument for low-memory sharding. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5355 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 15:13:16 +00:00
ebanks	ae42c0c7da	Bug fix based on GATK run report git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5354 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 14:18:12 +00:00
hanna	880c607d79	Disable validation of linear index against original linear index process. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5352 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 01:51:26 +00:00
hanna	dc62685a2f	For Ryan: force creation of BAM index when no reads are present in the BAM file. Temporary fix until Picard changes the behavior of indexing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5351 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 01:50:42 +00:00
hanna	43567b7fe3	Load the linear index without forcing the index for the entire contig to be loaded into memory. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5349 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 00:08:39 +00:00
ebanks	a20ce1436d	A temporary @hidden hack to get indel calling done for Phase I: don't try to call if there's too much coverage. Do not use this unless your last name rhymes with Shmoplin. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5348 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 19:22:27 +00:00
hanna	3c7ae0d1a6	Special case handling of unmapped region in low memory sharder. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5346 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 17:38:30 +00:00
hanna	dd30ad751a	Fix bug in low memory sharder's interval accumulator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5345 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 17:11:22 +00:00
hanna	d6145de970	More comprehensive tracking of position when bin trees are sparse. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5344 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 15:53:43 +00:00
ebanks	bb969cd3a2	EMIT_ALL_SITES now does exactly that - even when there's no coverage or too many deletions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5343 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 05:05:00 +00:00
rpoplin	ce34a8a918	New hidden option in VQSR to not parse the genotypes of the incoming training data. Updated VQSR training in methods development pipeline to be more in line with best practices. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5340 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 23:19:51 +00:00
hanna	e7089f9870	Fix for particularly small, isolated intervals: make sure the bounds of the bin tree are dictated by the lowest bin level, whether it exists or not. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5339 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 22:35:53 +00:00
hanna	c869d1c9cf	Fix misc issues in new protosharder regarding proper iterator termination when an unexpectedly small amount of data is present. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5338 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 21:14:18 +00:00
hanna	e75366f738	Fixed performance issue in protosharding code -- turns out that the index optimizer was mutating the data stored in the indices. Protosharding still disabled by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5334 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 17:32:12 +00:00
ebanks	8de83725f9	Simple walker to randomly break VCF files into (potentially unequal) subsets. Useful for e.g. cutting hapmap into training and evaluation sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5333 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 16:51:46 +00:00
delangel	d059d89a9d	Fixes and cleanups for indel eval module. Also outputs AT/CG ratio in dedicated column in IndelStatistics. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5332 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 12:07:50 +00:00
ebanks	05fac8583d	Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 04:40:54 +00:00
depristo	874406352c	Accidentally commited the N2 comparing test as well... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5330 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 04:15:30 +00:00
depristo	1dedfdb11b	Fixes for constrained movement Indel Realigner. Now sorts all of the reads in the interval before handing them to ConstrainedMateFixingSAMFileWriter to maintain correct contract between the two pieces of software git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5329 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 03:52:18 +00:00
depristo	d216830b92	Experimental linear version of the exact model. In testing, but gives identical results to N2 gold standard version, and passes integration tests. Performance optimizations still ongoing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5328 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 03:48:11 +00:00
ebanks	54facb2c51	Small change for Mauricio so that the correct metrics get output when running in GENOTYPE_GIVEN_ALLELES mode. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5327 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-27 06:08:32 +00:00
depristo	7ff8d23c64	Don't do genotype concordance on comp tracks without genotypes, even if they have an AC git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5321 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-25 21:11:50 +00:00
hanna	600f73cbd6	A checkpoint commit of two BAM reading projects going on simultaneously. These two projects are works in progress, and this checkin will provide a baseline against which to gauge improvements to both projects. Low-memory BAM protoshards (disabled by default): - Currently passing ValidatingPileupIntegrationTest. - Gets progressively slower throughout the traversal, but should run at least as fast as original implementation. - Uses 10+ file handles per BAM, but should use 3. BAM performance microbenchmark test system: - Currently tests performance of BAM reading using SAM-JDK vs. GATK - Tests do not hit all GATK performance hotspots. - New tests that require input data in a slightly different form are hard to implement. - Output of test results is not easily parseable (investigating Google Caliper for possible improvements). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5317 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-25 17:50:32 +00:00
asivache	abf3fcbb72	Little changes in recognized annotation terms; columns in annotated maf are now prioritized and multiple alternatives do not cause 'i don't know what to do' crash: e.g. if Chromosome and chr columns are both present, then Chromosome is taken (has a priority). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5302 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 16:19:06 +00:00
rpoplin	255cc246a2	Change in Methods development pipeline: dbsnp130 can't be used for anything, changed it to dbsnp129. Optimization for HaplotypeScore and the to-be-committed ReadRosRankSumTest in AlignmentUtils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5301 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 16:09:03 +00:00
chartl	97e1a5262e	-ct x no longer includes coverage in the previous bin BatchMerge - additional support for indels (can't just test the alternate allele when it's an extended event, must also specify that you want to use the dindel model when you actually test the allele) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5300 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 15:52:04 +00:00
ebanks	ee6f112556	Phase 3: constrained movement is now the only option available in the realigner (so I guess technically it's not really an option). Several command-line options are deprecated. Code cleaned up. Wiki updated. Release coming. One phase left... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5299 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 14:59:48 +00:00
ebanks	93888e570b	Phase 2: after hours of testing, confirming that constrained mode looks good so moving the integration tests over to use it. Some cleanup. More cleanup coming in Phase 3. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5298 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-24 06:23:41 +00:00
carneiro	75bd0129e7	quick bug fix. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5296 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-23 19:16:20 +00:00
ebanks	9357bee921	Don't skip tri-allelic alleles passed in - just choose the first one. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5293 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-23 17:25:50 +00:00
carneiro	a2301383bb	quick walker to find out where the reads mapped to huref were mapped in the consensus reference. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5292 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-23 17:00:17 +00:00
ebanks	318035c147	Fixing up the output system of the Unified Genotyper. Deprecating the -all_bases and -genotype arguments. Adding instead the --output_mode (EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES, EMIT_ALL_SITES) and --genotyping_mode (DISCOVERY, GENOTYPE_GIVEN_ALLELES) arguments. UG now does the correct thing when passed alleles (bound to the 'alleles' rod) to use for genotyping; added several integration tests to cover this case. This commit will break the batched calls merging script, but Chris knows this and is ready for it... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5288 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-22 06:07:18 +00:00
ebanks	d7f98ccd9c	Adding --doNotWriteOriginalQuals argument to BQ recalibrator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5286 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-22 04:00:00 +00:00
depristo	1a5d296737	ReplaceReadGroups. Fixes BAM files without read group info. MissingReadGroup points people to this tool now. Please point users on the forum to this tool now. Will migrate to Picard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5284 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-21 14:02:41 +00:00
kiran	cb95e68fc0	CpG is no longer a standard stratification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5273 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-18 07:17:35 +00:00
kiran	9ddee96f93	When subsetting by sample, need to take extra care that hom-ref sites don't accidentally get treated as variant sites in CompOverlap. Renamed convenience method for creating command-lines in integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5272 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-18 06:26:38 +00:00
delangel	1bc5c7e99b	boneheaded mistake, mixed up my min and max git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5271 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-18 04:02:14 +00:00
kiran	92c82200c9	Fixed an issue where an eval module with TableType objects would get an extra, empty table in the output, screwing up the parse in R. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5267 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-17 23:03:46 +00:00
asivache	7ffcade3c3	Added MNP to recognized and counted event types git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5266 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-17 22:37:38 +00:00
depristo	57c66b5602	Supports GQ now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5265 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-17 22:30:25 +00:00
delangel	f1d708f4d4	Fixes for HRun annotation in case of indels: a) In case of a deletion value was completely broken, we'd report 0 or -1. b) For indels, we report maximum of forward and backward values - I've seen empirically many sites which are not strand biased but which seem to be artifacts and the homopolymer run is always to the right only (because we left align by convention). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5260 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-17 18:57:21 +00:00
asivache	0e04e95245	Bug fix: when extracting reference sequence for the event from the reference genome, the tool was treating Deletions and MNPs of length N in exactly the same way: ref_bases[current_pos+1,...,current_pos+N]. This is correct for Deletions but not for MNPs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5258 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-17 16:15:42 +00:00
depristo	5a51c9a815	AWS_S3 logging is now enabled by default. It first tries to log internally at the Broad, and if it can't goes to AWS_S3. DEV option is removed git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5249 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-15 20:20:14 +00:00
asivache	7a11b4f35d	Another change in variant classification values git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5237 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-14 17:47:58 +00:00

1 2 3 4 5 ...

2545 Commits (0095aa2627bbfca377925de71df9dbf16fa6dd6b)