gatk-3.8

Commit Graph

Author	SHA1	Message	Date
droazen	84dd72e6cb	Adding in some read filters, updating MathUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6042 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:54:15 +00:00
droazen	4f7a64a798	Fixing broken walker as per GS; adding integration test to cover it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6040 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:54:04 +00:00
droazen	0e057276ae	Changing the default behavior of the IndelRealigner to run without Smith-Waterman. Changed around the integration tests accordingly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6039 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:53:58 +00:00
droazen	df71d5b965	bye bye git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6035 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:53:42 +00:00
droazen	9b90e9385d	Putting new association files, some qscripts, and the new pick sequenom probes file under local version control. I notice some dumb emacs backup files, I'll kill those off momentarily. Also minor changes to GenomeLoc (getStartLoc() and getEndLoc() convenience methods) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6034 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:53:37 +00:00
ebanks	745935ffc2	No longer used - instead see the ConstrainedMateFixingManager class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6030 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 19:38:17 +00:00
ebanks	b35df9a0f7	Removing unnecessary String.format calls git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6028 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 15:30:52 +00:00
delangel	b7a1beff3c	Bug fixes and rewrite of logic of several parts in SelectVariants: a) If we were selecting SNPs or Indels and there was one of each at the same location, only whichever one was pulled first was processed. b) Fixed logic error when selecting Mendelian Violations: if that option was used it wasn't possible to combine with other options. c) Fixed logic error when using -disc option: you shouldn't parse genotypes to check whether a site is present or not because a vcf can be sites-only and this is slow. d) Made -disc option work in the same way as other options: variants are now selected from "variant" track all the time, and variants which are not in disc track are kept. Inverse logic (keep disc variants not present in "variant") is confusing and prevents users from combining with different options. With these changes it is now possible to ask for example "Give me all indels which are Mendelian Violations, not in dbsnp and present in these samples" which was not possible before. Integration tests covering the above are forthcoming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6027 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 14:05:00 +00:00
ebanks	8e149cc52f	Fixing a silly bug of mine: when a realignment target begins at position 1 of a contig, it was possible to have some reads get emitted out of order (triggering an exception in the SAMFileWriter). This is fixed by moving around some parentheses. Tim, if you are reading this: feel free to take this fix in whenever it's convenient. I.e. it's not critical as the only user who has been hit by it has a reference with over 130K short contigs. Committing in SVN so that it gets incorporated immediately (and I can respond on GS now). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6024 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-21 21:42:38 +00:00
asivache	6dd41c8489	Nway writer takes another argument: whether to create index on the fly. Realigner in NWayOut mode currently will ALWAYS create index on the fly as there seems to be no clean way to extract the requested value from argument collection in the presence of a different @Output stream. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6023 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-21 17:26:04 +00:00
asivache	78461bac1e	Default logic (and name) has changed. Now somatic mode is default one. In order to run in single-sample (unpaired) mode, one has to use (hidden) --unpaired option. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6022 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-21 17:08:41 +00:00
chartl	c5de06a641	Fixing up the RefSeqCodec so a bad entry in RefSeq (some transcripts are odd and have a negative length which may signify something special (?) ) doesn't cause failure, but issues a warning instead. Integration tests pass. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6021 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-21 14:07:58 +00:00
asivache	7c322780d3	Nway out fixed: in this mode a special nwayout sam writer is instantiated and passed to constrained manager. All the dispatching of the reads into separate output sam streams is taken care by that writer, so no other special processing is needed at the realigner/manager level. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6020 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-20 19:35:26 +00:00
ebanks	600a6a43a6	Reverting previous commit, as promised. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6019 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-20 16:30:19 +00:00
ebanks	ee18c9b0c2	Temporary commit to please those in 320: re-support the -knownsOnly argument (@Hidden). This will be reverted in a sec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6018 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-20 16:28:58 +00:00
rpoplin	e8738f95c5	This warning message scares too many people. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6016 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-20 13:43:16 +00:00
depristo	4c6d0e6143	Added stratification by discrete allele count, just like AF, but requiring genotypes so it can be exact. Added docs on wiki, and integrationtest using Kiran's very nice fundamental VCF VariantEvalWalker now passes a pointer to itself to the Stratefication setVariantEvalWalker (and assoc. get method) so that stratefications can look at VEWalker variables to obtain information necessary for their calculations, like the list of eval samples. This is a better interface, in my opinion, than the current approach of extending the base abstract Stratefication to include an initialize function that has all arguments necessarily for any Strat. JEXL expressions now provide access to the VariantContext vc object itself, so you can write JEXL's that directly use VariantContext and associated functions from the command line. ExomePostQC Queue script now creates a byAC eval using the new strat, and no longer produces a byAF file (as this was not exact, and lead to strange punctile behavior when actual AF quantization was out of sync with fix quantization of AF strat. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6015 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-19 03:11:00 +00:00
asivache	64196b6c7a	Writer implementation that can dispatch reads to maltiple underlying bam files git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6013 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-17 20:44:17 +00:00
depristo	1afd24c831	SliceBams now handles properly the case where multiple read groups clash in the input BAM files git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6012 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-17 20:19:19 +00:00
ebanks	dd1d9cd76f	Forgot to deprecate the old args git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6009 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-16 17:54:44 +00:00
ebanks	4e85416af1	[Foiled yet again when trying to do this in git] Slight modifications in the argument structure for the IndelRealigner. Instead of boolean flags -knownsOnly and -doNotUseSW, we now have an enum --consensusDeterminationModel which lets you specify knowns only, also use indels in reads, or also use SW. Please note that the default behavior of IR has not changed at all (and won't for a few more days) - that'll be done in GIT (fingers crossed). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6008 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-16 17:35:37 +00:00
depristo	43fdd31e20	Significant performance optimization for reduced reads due to better algorithm for including reads in the variable regions. Fixed a critical bug that actually produced multiple copies of the same read in the variable regions with this optimization as well. Scala exploration script updated as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6005 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-16 12:54:59 +00:00
rpoplin	d7430c23f8	Bringing VQSR up to date with the 1000G v2b changes git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6000 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-14 20:23:43 +00:00
asivache	04ecbf10ab	Fixes the constraint-generated error about stop being less than start in GenomeLocParser.createGenomeLoc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5999 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-14 17:44:11 +00:00
ebanks	d00d4fd4d6	Obsolete covariate class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5993 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-14 14:11:47 +00:00
rpoplin	db43e3f1ab	Fixing an apparent parenthesis matching problem git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5986 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-13 18:52:14 +00:00
rpoplin	3534f412c9	Better error message for the case of input variants found in ApplyRecalibration that were never seen during VariantRecalibrator. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5979 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-13 14:45:28 +00:00
rpoplin	6231bba288	Bug fix for mergeInfoWithMaxAC git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5978 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-12 20:10:16 +00:00
ebanks	1f4469976e	Made into UserException with better error message git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5977 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-12 03:38:52 +00:00
rpoplin	0d6ce91614	When running CombineVariants with -mergeInfoWithMaxAC the set field will be added appropriately git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5974 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-10 14:35:48 +00:00
delangel	f8ffda6835	a) Hidden, experimental argument to UnifiedGenotyper that makes code, when in GenotypeGivenAlleles mode, ignore SNP alleles mixed in with indels in complex records - theory is that SNP sites behave statistically differently when doing VQSR so those alleles/sites should be treated separately. b) Bug fix: multiallelic indel records where not being treated properly by VQSR because vc.isIndel() returns false with them. Correct general treatment for now is to do (vc.isIndel()\|\|vc.isMixed()). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5973 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-09 19:19:23 +00:00
rpoplin	17e17d3c3c	Misc cleanup in VQSR. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5972 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-09 18:37:37 +00:00
rpoplin	895e86c544	Annotations used to build the 1000G consensus callsets are now standard annotations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5969 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-09 17:03:39 +00:00
depristo	93d6e17762	Final, documented version of CalibrateGenotypeLikelihoods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5966 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 20:22:28 +00:00
depristo	44287ea8dc	ReducedBAM changes to downsample to a fixed coverage over the variable regions. Evaluation script now has filters and eval. commands. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5965 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 19:36:08 +00:00
kiran	49b021d435	Changed the definition of degeneracy (it's at the site level - degeneracy of a position in a codon, not degeneracy of the amino acid itself like I initially thought. Added the ability to supply an ancestral allele track (available in /humgen/gsa-hpprojects/GATK/data/Ancestor/). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5963 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 15:07:31 +00:00
depristo	0f43b10c39	Optimization in CombineVariants when merging into a sites_only VCF VariantContextUtils now was a utility function that creates a sitesOnlyVariantContext from an input VC Add complex merge test of SNPs and indels from the new batch merge wiki in : http://www.broadinstitute.org/gsa/wiki/index.php/Merging_batched_call_sets with multiple alleles for an indel. Created a BatchMergeIntegrationTest that uses GGA with the complex merged input alleles to genotype SNPs and Indels with multiple alleles simultaneously in NA12878. Looks great. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5959 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 14:31:46 +00:00
delangel	1d6486a28f	First part of fix for correctly processing mixed multi-allelic records: correctly compute start/stop of vc when there are no null alleles (i.e. record is not a simple indel). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5958 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 13:36:18 +00:00
delangel	d27800e07c	a) Forgot to commit this ages ago: uncomment code to ignore hard clipped bases when computing indel likelihoods. b) First part of fix for correctly processing mixed multi-allelic records: correctly compute start/stop of vc when there are no null alleles (i.e. record is not a simple indel). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5957 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 11:28:17 +00:00
hanna	ad97099df6	Getting rid of a few extra, very explicit qualifications so that the public/ private bifurcation script doesn't have to discover them. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5956 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-08 03:08:47 +00:00
ebanks	bb6c0db783	We found the cause of the inconsistency. Woo hoo! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5955 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-07 15:13:58 +00:00
hanna	ca48ea78df	At Picard team's request, generate md5s for generated BAM files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5954 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-07 04:25:40 +00:00
depristo	311dfa0998	Now builds examples, as I expected. GATKPaperGenotyper lives again. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5953 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-07 00:13:44 +00:00
alecw	2901abf070	Switch from PriorityQueue to TreeSet for better and more consistent performance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5952 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-06 20:41:30 +00:00
ebanks	2c57721ed2	Updated printouts to help with debugging. Issue does appear to be deterministic though. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5950 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-06 01:04:07 +00:00
ebanks	27dfb53d26	We really don't want to be advising the user to use an unsafe option - really, they should fix their busted bam file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5949 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-05 05:18:16 +00:00
delangel	d534241f35	Major revamp of annotations for indels: a) All rank sum tests now work for indels including multiallelic sites. For the latter cases, rank sum test is REF vs most common allele b) Redid computation of HaplotypeScore for indels. It's now trivially easy to do because we are already computing likelihoods of each read vs haplotypes in GL computation so we reuse that if available. For multiallelic case, we score against N haplotypes where N is total called alleles. Drawback is that all cases need information contained in likelihood table that stores likelihood for each pileup element, for each allele. If this table is not available we dont annotate, so we can only fully annotate indels right now when running UG but not when running VariantAnnotator alone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5947 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-04 15:34:24 +00:00
delangel	a8faacda4e	Major change to UG engine to support: a) Genotype given alleles with indels b) Genotyping and computing likelihoods of multi-allelic sites. When GGA option is enabled, indels will be called on regular pileups, not on extended pileups (extended pileups will be removed shortly in a next iteration). As a result, likelihood computation is suboptimal since we can't see reads that start with an insertion right after a position, and hence quality of some insertions is removed and we could be missing a few marginal calls, but it makes everything else much simpler. For multiallelic sites, we currently can't call them in discovery mode but we can genotype them and compute/report full PL's on them (annotation support comes in next commit). There are several suboptimal approximations made in exact model to compute this. Ideally, joint likelihood Pr(Data \| AC1=i,AC2=j..) should be computed but this is hard. Instead, marginal likelihoods are computed Pr(Data \| ACi=k) for all i,k, and QUAL is based on highest likelihood allele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5941 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 22:13:07 +00:00
depristo	cd293f145b	More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 20:11:31 +00:00
ebanks	80cbc1924b	Oops, just realized that I forgot to comment my commit from yesterday so it was clear what was happening git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5938 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 18:06:41 +00:00

1 2 3 4 5 ...

4416 Commits (84dd72e6cb5d4cfd4a5031a75cda250142530d75)