gatk-3.8

Commit Graph

Author	SHA1	Message	Date
kshakir	b954a5a4d5	- After removing special code for intervals, instead of being of type File they are generated as List[File]. Changed previous checkin that was appending to this list and instead assigning a singleton list. - More cleanup including removing the temporary classes and intermediate error files. Quieting any errors using Apache Commons IO 2.0. - Counting the contigs during the QScript generation instead of the end user having to pass a separate contig interval list. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4539 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 06:37:28 +00:00
kshakir	b88cfd2939	Updated MD5s of VCFs, since the approximate command line arguments injected into the VCF headers now have a little more order to them thanks to changes in the ParsingEngine. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4538 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 03:07:40 +00:00
ebanks	8f38ebf98e	Throw a user exception when using the clustered SNP filter in the presence of ref calls. It's unfortunate, but until we get a windowed ROD context this is just too much of a headache to support. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4537 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 02:44:10 +00:00
kshakir	88a0d77433	Changed parsing engine to store the order the argument bindings based on their definition in the class, moving "-T" to the front of Queue command lines. Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator. Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered. Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers. Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 21:43:52 +00:00
ebanks	91049269c2	Optimizations across the board, with help from Guillermo, Matt, and JProfiler. Too tired to give details now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4535 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 20:47:41 +00:00
fromer	f76865abbc	ReadBackedPhasing now uses a SortedVCFWriter to simplify, and has the ability to merge phased SNPs into MNPs on the fly [turned off by default]; MergeSegregatingPolymorphismsWalker can also do this as a post-processing step; Integration tests for MergeSegregatingPolymorphismsWalker were also added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4534 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 20:27:10 +00:00
fromer	e8079399ac	Added flush() method to VCFWriters git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4533 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 20:23:22 +00:00
fromer	00726b6c4b	Added mergeIntoMNPs to merge successive VCF records into a single MNP VCF [if possible] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4532 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 19:40:26 +00:00
fromer	55230ce5f3	Added startsBefore, startsAfter, and minDistance [calculates distance between any pair of bases in the two GenomeLocs] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4531 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 19:12:34 +00:00
ebanks	4f77581087	More optimizations for HaplotypeScore: pulling final constants out of loops git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4530 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 17:40:57 +00:00
hanna	20fac43521	Add extra logging to the GATK run report at the start of metrics aggregation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4529 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 17:32:51 +00:00
ebanks	a205900eff	Naughty use of Strings in HaplotypeScore literally double the runtime of Unified Genotyper. Moved over to bytes and no longer allow Strings in the Haplotype util class. New round of profiling on tap for tomorrow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4528 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 03:32:21 +00:00
depristo	f9541b78d3	Timing of traversal now starts at the start of the traversal, so the rate is reasonable right off the bat. For example, we now see: INFO 22:45:02,476 TraversalEngine - [TRAVERSAL STARTING]; INFO 22:45:32,484 TraversalEngine - [PROGRESS] Traversed to 2:50850686, processing 18,646 sites in 30.05 secs (1611.50 secs per 1M sites) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4527 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 02:47:34 +00:00
ebanks	c305b41da4	Patch for James git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4526 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 02:39:16 +00:00
depristo	f7ce18553e	GenotypeConcordance now prints interesting sites more nicely. RMDTrackBuilder is now uses the root class FeatureSource not BasicFeatureSource. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4525 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 00:29:02 +00:00
ebanks	7a291a8ff3	First pass at a VCF validator. Will test more tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 19:55:49 +00:00
chartl	341e93ee12	The reference fixer seems to have munged the OMNI rather than making it better. Looks like some sites need to only have the ref and alt bases swapped, and others need to have the genotypes swapped as well? E.g. some subset need A C 1/1 --> C A 0/0 while another subset need A C 1/1 --> C A 1/1 it's unclear how big these subsets are (or even if one is empty). What I do know is, doing the first one totally screws up concordance metrics for the 421-sample chip. So either something else needs to be done, or there's a bug in this walker. Until I know for sure, I've added an initialize exception to disable this thing... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4523 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 12:50:24 +00:00
ebanks	5251f49a90	Including Marian Thieme's BaseCounts class (with some modifications) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4522 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 03:07:30 +00:00
hanna	c5f105d050	Fix boneheaded mistake in the new interval filtering code I added on Sunday. Sorry everyone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4521 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 01:20:12 +00:00
kshakir	81479229e1	QScript authors can now tag functions as intermediate. Functions tagged as intermediate will be skipped unless another function in the graph needs their output. Re-logging the failed jobs and the path to their log files at the end of a run. Added a parameter -bigMemQueue for the fullCallingPipeline.q instead of hardcoding gsa (gsa was backed up and it was actually faster to run on week). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4520 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 22:11:14 +00:00
ebanks	524cb8257c	Renaming for accuracy git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4519 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:11:07 +00:00
ebanks	0fe504b748	Use filtered depth for Exact model (just like grid search) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4518 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:08:31 +00:00
ebanks	d54d9880d7	Now that G's new genotyping algorithm is live, I've cleaned up the code to completely separate the grid search from the exact model. AlleleFrequencyCalculationModel is now completely abstract. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4517 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 18:04:06 +00:00
ebanks	80e5ac65b4	CAP_BASE_QUALITY needs to be included in the clone() method for it to be usable in UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4516 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 03:11:03 +00:00
ebanks	f962039273	Using James P's patch for the liftover script git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4515 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-18 01:44:54 +00:00
hanna	6af9532090	Fix for GATK slowdowns at the ends of intervals. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4514 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 23:21:23 +00:00
chartl	5889138f4a	facepalm forgot to add the samples to the header. How could the VCFWriter let me get away with something so boneheaded?! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4513 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 05:36:29 +00:00
ebanks	e6d038067b	cleanup the temp index too git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4512 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 04:41:42 +00:00
chartl	2bc5971ca1	Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it. Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. IMPORTANT I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do. I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 03:18:01 +00:00
ebanks	7aa030a9a4	Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 02:54:12 +00:00
kshakir	196029c0b4	Removed obsolete -bsubWait from sanity check. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4509 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-16 02:14:34 +00:00
kshakir	9dc2e931b6	Saving the order functions are added to in the QScript. Using the order during submission of ready jobs (but not currently dryrun) and during -status. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4508 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 20:00:35 +00:00
chartl	8b2d387643	Added in an eval module that calculates the dispersion histograms between eval and comp (e.g. M_{i,j} = # of times eval observed to have AC i, comp AC j -- for af it's i/100 vs j/100 ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4507 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 19:07:43 +00:00
ebanks	f78ff08e2b	This is less correct than my previous change but it's what UGv1 does and now is not the right time to start mucking with things. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4506 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 18:56:45 +00:00
ebanks	471c18054f	Fix for SB calculation: the best overall AF might not have any mass when just looking at reads from a single strand. We need to compute the best AF for each stratification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4505 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 17:51:18 +00:00
kshakir	7157cb9090	While bkill'ing on the shutdown thread Queue will no longer try to submit more jobs on the original thread. Updated pipeline output structure to current recommendations by Corin. Directories are now automatically before the function runs. Fixed several bugs with scatter gather binding when the script author needs to change the directories. Fixed bug with tracking of log files for CloneFunctions. More error handling and logging of exceptions (good test environment while LSF was down this early AM!) Removed cleanup utility for scatter gather. SG Output structure has changed significantly. Will need to discuss and find a better approach for Queue programatically deleting files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4504 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 17:01:36 +00:00
asivache	42c3d74432	bug fix git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4503 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 16:27:40 +00:00
chartl	c9d473edee	More changes to Variant Eval and Genotype Concordance (passes all integration tests): 1: -sample can now include a file, which will be parsed for sample-name entries 2: If you request a sample to run analysis on, but it is not present in any of your RODs, VEW will exception out 3: Change added to parse Integer, String, and List<Integer> type Allele Count annotations (error otherwise) 4 [slightly problematic]: The count objects now maintain row-keys in order, as the keys were taking an inordinate amount of time in onTraversalDone (multiple calls to getRowKeys(), so many multiple sorts of the same underlying unsorted object, very bad) There is a legacy comparison object which is unused which I will strip out soon. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4502 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 12:40:36 +00:00
ebanks	3d988576a6	updated liftover script: it no longer needs to re-sort the vcf git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4501 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 08:03:41 +00:00
ebanks	954dd84f51	Adding an integration test (against hg18 this time) that requires on-the-fly sorting in order to work properly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4500 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:45:21 +00:00
ebanks	9f54170dff	Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:27:01 +00:00
ebanks	d41c252b13	Looking over the calling results with Ryan, it's clear that while the grid search optimization (ignoring samples that are clearly ref) can work for assigning genotypes, it cannot be used for calculating P(AF>0). There's too much area under the likelihood curve that gets lost and the QUALs are negatively affected. However, testing showed that this only slightly affects runtime (~15 minutes per 1Mbase for the 1kg allpops). The optimization does remain for genotyping. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4498 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 19:06:32 +00:00
corin	5e0c4ecc21	Added DbSnp to VariantEval git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4497 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 17:02:17 +00:00
kshakir	63e3848187	Added status email support with -statusTo. Will send emails on failure of an individual function or success/failure of the whole pipeline. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4496 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 15:58:52 +00:00
ebanks	2606e67cf1	Reverting Matt's change from yesterday which I accidentally blew away when trying to cope with the stupid svn update issues we've been plagued with recently. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4495 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 14:40:42 +00:00
ebanks	cfb33d8e12	Filtering optimizations are now live for UGv2. Instead of re-computing filtered bases at every locus, they are computed just once per read and stored in the read itself. Eyeballing the results on the ~600 sample set from 1kg, we cut out ~40% of the runtime! QUALs are now sometimes different from UGv1 because I noticed a bug in v1 where samples with spanning deletions only were assigned ref calls instead of no-calls which ever so slightly affects the QUAL. Not a big deal though. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4494 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 05:04:28 +00:00
chartl	4ac636e288	Minor change: when tabulating concordance by AC, ignore sites with multiple segregating alleles in the population, at least for now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4493 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-14 01:35:33 +00:00
kshakir	5034ca18dc	...and forgot to sync up the changes to CommandLineFunction with CloneFunction. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4492 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 22:40:02 +00:00
chartl	7c9ef59d65	This is simultaneously a minor and major change to VariantEval, so take heed: The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests. GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g. evalAC0 evalAC1 evalAC2 compAC0 compAC1 compAC2 Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 22:26:15 +00:00
kshakir	5ee12875fb	Emergency fix for Ryan: - Catching errors when LSF fails and retrying. - When LSF retries fail, catching the error, marking the job as failed, and no longer bkilling everything by exiting Queue. - Caching function fields by class instead of each instance of a function saving a list of its fields. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4490 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 22:22:01 +00:00

1 2 3 4 5 ...

4498 Commits (b954a5a4d5b020e40919afb451c211410ccffcc6) All Branches Search

4498 Commits (b954a5a4d5b020e40919afb451c211410ccffcc6)

All Branches