gatk-3.8

Commit Graph

Author	SHA1	Message	Date
delangel	a8faacda4e	Major change to UG engine to support: a) Genotype given alleles with indels b) Genotyping and computing likelihoods of multi-allelic sites. When GGA option is enabled, indels will be called on regular pileups, not on extended pileups (extended pileups will be removed shortly in a next iteration). As a result, likelihood computation is suboptimal since we can't see reads that start with an insertion right after a position, and hence quality of some insertions is removed and we could be missing a few marginal calls, but it makes everything else much simpler. For multiallelic sites, we currently can't call them in discovery mode but we can genotype them and compute/report full PL's on them (annotation support comes in next commit). There are several suboptimal approximations made in exact model to compute this. Ideally, joint likelihood Pr(Data \| AC1=i,AC2=j..) should be computed but this is hard. Instead, marginal likelihoods are computed Pr(Data \| ACi=k) for all i,k, and QUAL is based on highest likelihood allele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5941 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 22:13:07 +00:00
kshakir	4c6751ec3c	Added argument to WGP and HSP to allow more memory. Upped the WGP VQSR memory to 32g to power through the filtering whole genome. TODO: Figure out what the right amount is. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5940 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 20:48:37 +00:00
depristo	cd293f145b	More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 20:11:31 +00:00
ebanks	80cbc1924b	Oops, just realized that I forgot to comment my commit from yesterday so it was clear what was happening git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5938 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 18:06:41 +00:00
fromer	e4eb8087bc	A VariantContext can now be isSymbolic. More importantly, multi-allelic variants are now properly handled in determining their type [using isMixed only if any of the biallelic variant types differ between the alt alleles]. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5937 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 18:02:47 +00:00
depristo	b4c479bcb0	Support for reducedReads in the pileup and UG. Totally experimental -- the code interface could change, and so could the implementation. Only works for SNPs now. Pileup has contracts as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5936 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 16:39:01 +00:00
delangel	2df12472c2	One more step in commit to support multi-allelic indel genotyping and processing: utility class that supports multi-allelic genotype likelihoods git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5935 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 16:08:29 +00:00
ebanks	420d8feff6	No one should be calling the createHeader method(s) directly, but instead should be going through the full readHeader method (because it first sets the version); therefore I made them package protected and merged them. Updated the various unit tests that were using createHeader and were dangerously assuming that the header version was defaulting to 4.0 (which it no longer does). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5934 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 02:17:37 +00:00
carneiro	32ac7be86a	new name to the pipeline, it's now in core, happy to support it. ps: Can't wait for GIT ! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5933 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 23:34:54 +00:00
carneiro	a4ffae880d	Subversion crashed my intellij BADLY, so now moving the data processing pipeline to core in 2 steps. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5932 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 23:31:24 +00:00
carneiro	36db9bdcd5	Implemented and tested BWA alignment in the data processing pipeline. caveat: Right now bwa only supports one read group, so if the original file had multiple @RG lines, only the first one will be kept. (working on a solution to this) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5931 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 23:03:07 +00:00
carneiro	c85a1d9210	Implemented and tested BWA alignment in the data processing pipeline. Renamed it and moved to core. Happy to support it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5930 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 22:58:55 +00:00
depristo	86df10ec09	UnitTests for ConsensusSpan infrastructure git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 22:44:52 +00:00
fromer	ef56b48eef	Add CNV sub-dir git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5928 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:47:13 +00:00
fromer	74298f6858	Basic walker to calculate statistics of CNV genotyping copy counts git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5927 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:46:35 +00:00
depristo	ad9dca9137	Package updated. Copyrights added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:29:27 +00:00
depristo	3d628f06f0	moved to playground git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:25:26 +00:00
depristo	429833c05a	Intermediate commit (DVCS, where are you?) of a fully operational ReducedRead walker. Now results in minor differences in the raw calls (filtering is a different matter) in an exome but 20x less disk space than the full exome data. Changes to the UG necessary to process reduced reads are not yet committed, as they are being tested. This code is being moved to playground now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5924 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:13:31 +00:00
ebanks	dd6d61c031	Adding integration test to cover the case of a read that only covers an insertion (i.e. no M in the CIGAR string). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5923 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 21:02:47 +00:00
ebanks	d0ca6f8a9c	Patch for case that a read spans only an insertion (i.e. no Ms in the CIGAR string): the end position should not be less than the start position (which is how Picard defines it) but instead should be equal to it. This is just a patch; we'll get a proper solution in at some point. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5922 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 20:40:56 +00:00
carneiro	355be57539	fixing the pipeline so that it still works while I'm adding support for BWA. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5921 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 19:32:28 +00:00
ebanks	3302a733ef	Fixed docs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5920 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 16:02:14 +00:00
chartl	84c2c5d7e6	Stop running away from my commits, test modules. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5919 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 13:05:53 +00:00
chartl	092952db44	After verifying that the changes to these tests were all in the RankSum annotations, I'm commiting fixes to the test md5s. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5918 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 13:01:18 +00:00
ebanks	c7fe062cb7	Refactored the VCF codec classes to minimize code duplication (which happened during the VCF3/4 split). Now, both codecs extend the AbstractVCFCodec class and all shared functionality exists there. Only methods that differ between the various codecs (e.g. because FILTER strings are encoded differently) are defined in the actual codecs. While I was in there, I put in checks for invalid empty inputs in the ID, FILTER, and INFO fields. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5917 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 19:40:47 +00:00
ebanks	81d9808eea	Next version of test output for non-determinism git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5916 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 19:36:56 +00:00
chartl	511cd48d7a	There is an edge case ( \|Set1\| = 5, \|Set2\| = 4) where the exact p-value exceeds the range of the normal distribution we want to invert. For the edge cases, this happens exactly at the mean, and so this can be safely replaced with a z value of 0.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5915 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 17:30:09 +00:00
carneiro	dcd13060e1	created wiki page for Print Reads and changed help to match wiki. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5914 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 16:26:32 +00:00
droazen	8f6af299d8	Remove what is hopefully the last of the evil core -> playground dependencies. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5913 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 16:22:35 +00:00
carneiro	8f3e8f934d	added a quick option to print the first n reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5912 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 16:16:50 +00:00
chartl	a79967d9af	After extensive testing of MannWhitneyU: - Verified that exact calculations do agree with R's dwilcox() - Verified that exact calculations do not agree with R's wilcox.test + This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values) - Can now specify MWU to calculate cumulative exact tests, rather than point probabilities - Z-scores are now calculated properly for exact tests + Previously, z-values calculated by inverting normal CDF from U-statistic PDF + Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate + Additional tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 15:51:27 +00:00
rpoplin	2b5683909e	Updated VQSR integration tests because of the new Omni file. Fixed overflow condition in FisherStrand when the depth is too high. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5910 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 14:20:37 +00:00
hanna	6cc84c3ce2	Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can continue to exist and live in playground (and thus outside of the normal release / git release branch). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 02:54:55 +00:00
ebanks	44cb7e4980	Renaming to make grepping through the output less confusing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5908 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-31 19:54:44 +00:00
ebanks	b75583a90b	Adding debug statements for David to aid in testing the non-determinism problem. I wouldn't recommend running with --stats temporarily (or ever in fact, which is why it's @Hidden). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5907 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-31 19:53:59 +00:00
droazen	c50d290133	Removing printf's used for debugging -- they have served their purpose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5906 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-31 14:06:37 +00:00
delangel	0aef5c0074	Totaly experimental, possibly useless annotation that logs # of MQ0 reads / total depth, TBD if VQSR can use it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5905 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-30 14:05:39 +00:00
kshakir	8d294dd6e6	For the snps to create combine snps and filtered indels, now using a VCF with just snps instead of vcf with snps plus unfiltered indels. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5904 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-29 04:17:18 +00:00
kiran	b4d379584c	Commented out the generation of the GATKReport that I was using for debugging. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5903 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:15:09 +00:00
kiran	2a9c75c5ba	Throw an exception if the programmer tries to access a column that doesn't exist. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5902 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:08:48 +00:00
kiran	f3b38c0d3e	Fixed a bug in my math where I assumed the genotype likelihoods were normalized to 1.0 when they in fact are not. Now genotypes get altered when a different genotype configuration leads to a more consistent answer with regards to inheritance constraints. There's the question of what to do when two configurations are almost equally likely - I should probably filter those events out. But currently there is no threshold on the transmission probability. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5901 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:08:05 +00:00
carneiro	5974675b43	Two intermediate commits, to work over the weekend. ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model. dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 22:03:08 +00:00
carneiro	69d9b5989f	documenting this walker as it may be useful to others in the future. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5899 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 21:58:51 +00:00
carneiro	2524216d4b	Added the R script for VQSR git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5898 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 21:56:56 +00:00
kshakir	77cae39c8e	Step towards tribble precompiled jar, support in build.xml for source with fallback to the checked in jar. Current tribble-129M.jar in SVN does not work with current version of GATK code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5897 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 21:04:27 +00:00
droazen	a50c40ed05	Temporary commit to aid in investigation of recent intermittent IndelRealignerIntegrationTest failures -- yes, it's the classic printf() debugging technique. Will revert in a day or two once I get the data I need :) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5896 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 20:01:57 +00:00
carneiro	260301016a	cleaned up the scripts and created an interval library to facilitate future reuse. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5895 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 19:35:36 +00:00
carneiro	0048f1f6d3	Lots of interval_list file utility scripts 1. findGenes : Parses the Genetic Association Database (from NIH) into an annotated 'genes of interest' interval_list file. 2. sortGenesByCoverage : combines the interval_list of the genes of interest with the report from GATK's DepthOfCoverage, generating an annotated interval_list with total and average coverages on each gene. 3. hasTheseTargets : Give it a list of targets (example: exon targets) and any interval_list (example: genes of interest) and it will generate an annotated interval_list of all the exons that are contained in the list of genes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5894 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 18:31:07 +00:00
rpoplin	2227f49220	misc cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5893 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 16:49:20 +00:00
rpoplin	9e834391fe	We now skip over all covering RODs in the BQSR as intended instead of just those which can be converted into a VariantContext. All the integration tests change because of subtleties in how certain dbsnp rod records are being converted into VCs. Added integration test which uses a bed file as the list of known polymorphic sites. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5892 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 16:32:17 +00:00

1 2 3 4 5 ...

5900 Commits (a8faacda4e57d4cfea2b6f6539a7f924c5d3fab1) All Branches Search

5900 Commits (a8faacda4e57d4cfea2b6f6539a7f924c5d3fab1)

All Branches