Commit Graph

5900 Commits (a8faacda4e57d4cfea2b6f6539a7f924c5d3fab1)

Author SHA1 Message Date
delangel a8faacda4e Major change to UG engine to support:
a) Genotype given alleles with indels
b) Genotyping and computing likelihoods of multi-allelic sites.

When GGA option is enabled, indels will be called on regular pileups, not on extended pileups (extended pileups will be removed shortly in a next iteration). As a result, likelihood computation is suboptimal since we can't see reads that start with an insertion right after a position, and hence quality of some insertions is removed and we could be missing a few marginal calls, but it makes everything else much simpler.
For multiallelic sites, we currently can't call them in discovery mode but we can genotype them and compute/report full PL's on them (annotation support comes in next commit). There are several suboptimal approximations made in exact model to compute this. Ideally, joint likelihood Pr(Data | AC1=i,AC2=j..) should be computed but this is hard. Instead, marginal likelihoods are computed Pr(Data | ACi=k) for all i,k, and QUAL is based on highest likelihood allele. 




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5941 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:13:07 +00:00
kshakir 4c6751ec3c Added argument to WGP and HSP to allow more memory.
Upped the WGP VQSR memory to 32g to power through the filtering whole genome. TODO: Figure out what the right amount is.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5940 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:48:37 +00:00
depristo cd293f145b More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:11:31 +00:00
ebanks 80cbc1924b Oops, just realized that I forgot to comment my commit from yesterday so it was clear what was happening
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5938 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:06:41 +00:00
fromer e4eb8087bc A VariantContext can now be isSymbolic. More importantly, multi-allelic variants are now properly handled in determining their type [using isMixed only if any of the biallelic variant types differ between the alt alleles].
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5937 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:02:47 +00:00
depristo b4c479bcb0 Support for reducedReads in the pileup and UG. Totally experimental -- the code interface could change, and so could the implementation. Only works for SNPs now. Pileup has contracts as well.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5936 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:39:01 +00:00
delangel 2df12472c2 One more step in commit to support multi-allelic indel genotyping and processing: utility class that supports multi-allelic genotype likelihoods
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5935 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:08:29 +00:00
ebanks 420d8feff6 No one should be calling the createHeader method(s) directly, but instead should be going through the full readHeader method (because it first sets the version); therefore I made them package protected and merged them. Updated the various unit tests that were using createHeader and were dangerously assuming that the header version was defaulting to 4.0 (which it no longer does).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5934 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 02:17:37 +00:00
carneiro 32ac7be86a new name to the pipeline, it's now in core, happy to support it.
ps: Can't wait for GIT !

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5933 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:34:54 +00:00
carneiro a4ffae880d Subversion crashed my intellij BADLY, so now moving the data processing pipeline to core in 2 steps.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5932 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:31:24 +00:00
carneiro 36db9bdcd5 Implemented and tested BWA alignment in the data processing pipeline.
caveat: Right now bwa only supports one read group, so if the original file had multiple @RG lines, only the first one will be kept. (working on a solution to this)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5931 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:03:07 +00:00
carneiro c85a1d9210 Implemented and tested BWA alignment in the data processing pipeline.
Renamed it and moved to core. Happy to support it.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5930 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:58:55 +00:00
depristo 86df10ec09 UnitTests for ConsensusSpan infrastructure
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:44:52 +00:00
fromer ef56b48eef Add CNV sub-dir
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5928 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:47:13 +00:00
fromer 74298f6858 Basic walker to calculate statistics of CNV genotyping copy counts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5927 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:46:35 +00:00
depristo ad9dca9137 Package updated. Copyrights added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:29:27 +00:00
depristo 3d628f06f0 moved to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:25:26 +00:00
depristo 429833c05a Intermediate commit (DVCS, where are you?) of a fully operational ReducedRead walker. Now results in minor differences in the raw calls (filtering is a different matter) in an exome but 20x less disk space than the full exome data. Changes to the UG necessary to process reduced reads are not yet committed, as they are being tested. This code is being moved to playground now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5924 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:13:31 +00:00
ebanks dd6d61c031 Adding integration test to cover the case of a read that only covers an insertion (i.e. no M in the CIGAR string).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5923 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:02:47 +00:00
ebanks d0ca6f8a9c Patch for case that a read spans only an insertion (i.e. no Ms in the CIGAR string): the end position should not be less than the start position (which is how Picard defines it) but instead should be equal to it. This is just a patch; we'll get a proper solution in at some point.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5922 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 20:40:56 +00:00
carneiro 355be57539 fixing the pipeline so that it still works while I'm adding support for BWA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5921 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 19:32:28 +00:00
ebanks 3302a733ef Fixed docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5920 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 16:02:14 +00:00
chartl 84c2c5d7e6 Stop running away from my commits, test modules.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5919 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:05:53 +00:00
chartl 092952db44 After verifying that the changes to these tests were all in the RankSum annotations, I'm commiting fixes to the test md5s.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5918 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:01:18 +00:00
ebanks c7fe062cb7 Refactored the VCF codec classes to minimize code duplication (which happened during the VCF3/4 split). Now, both codecs extend the AbstractVCFCodec class and all shared functionality exists there. Only methods that differ between the various codecs (e.g. because FILTER strings are encoded differently) are defined in the actual codecs. While I was in there, I put in checks for invalid empty inputs in the ID, FILTER, and INFO fields.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5917 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:40:47 +00:00
ebanks 81d9808eea Next version of test output for non-determinism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5916 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:36:56 +00:00
chartl 511cd48d7a There is an edge case ( |Set1| = 5, |Set2| = 4) where the exact p-value exceeds the range of the normal distribution we want to invert. For the edge cases, this happens exactly at the mean, and so this can be safely replaced with a z value of 0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5915 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 17:30:09 +00:00
carneiro dcd13060e1 created wiki page for Print Reads and changed help to match wiki.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5914 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:26:32 +00:00
droazen 8f6af299d8 Remove what is hopefully the last of the evil core -> playground dependencies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5913 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:22:35 +00:00
carneiro 8f3e8f934d added a quick option to print the first n reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5912 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:16:50 +00:00
chartl a79967d9af After extensive testing of MannWhitneyU:
- Verified that exact calculations do agree with R's dwilcox()
 - Verified that exact calculations do not agree with R's wilcox.test
   + This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values)
 - Can now specify MWU to calculate cumulative exact tests, rather than point probabilities
 - Z-scores are now calculated properly for exact tests
   + Previously, z-values calculated by inverting normal CDF from U-statistic PDF
   + Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate
   + Additional tests



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 15:51:27 +00:00
rpoplin 2b5683909e Updated VQSR integration tests because of the new Omni file. Fixed overflow condition in FisherStrand when the depth is too high.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5910 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 14:20:37 +00:00
hanna 6cc84c3ce2 Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can
continue to exist and live in playground (and thus outside of the normal release
 / git release branch).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 02:54:55 +00:00
ebanks 44cb7e4980 Renaming to make grepping through the output less confusing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5908 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:54:44 +00:00
ebanks b75583a90b Adding debug statements for David to aid in testing the non-determinism problem. I wouldn't recommend running with --stats temporarily (or ever in fact, which is why it's @Hidden).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5907 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:53:59 +00:00
droazen c50d290133 Removing printf's used for debugging -- they have served their purpose.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5906 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 14:06:37 +00:00
delangel 0aef5c0074 Totaly experimental, possibly useless annotation that logs # of MQ0 reads / total depth, TBD if VQSR can use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5905 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-30 14:05:39 +00:00
kshakir 8d294dd6e6 For the snps to create combine snps and filtered indels, now using a VCF with just snps instead of vcf with snps plus unfiltered indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5904 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-29 04:17:18 +00:00
kiran b4d379584c Commented out the generation of the GATKReport that I was using for debugging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5903 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:15:09 +00:00
kiran 2a9c75c5ba Throw an exception if the programmer tries to access a column that doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5902 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:08:48 +00:00
kiran f3b38c0d3e Fixed a bug in my math where I assumed the genotype likelihoods were normalized to 1.0 when they in fact are not. *Now* genotypes get altered when a different genotype configuration leads to a more consistent answer with regards to inheritance constraints. There's the question of what to do when two configurations are almost equally likely - I should probably filter those events out. But currently there is no threshold on the transmission probability.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5901 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:08:05 +00:00
carneiro 5974675b43 Two intermediate commits, to work over the weekend.
ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model.
dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:03:08 +00:00
carneiro 69d9b5989f documenting this walker as it may be useful to others in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5899 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 21:58:51 +00:00
carneiro 2524216d4b Added the R script for VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5898 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 21:56:56 +00:00
kshakir 77cae39c8e Step towards tribble precompiled jar, support in build.xml for source with fallback to the checked in jar.
Current tribble-129M.jar in SVN does not work with current version of GATK code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5897 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 21:04:27 +00:00
droazen a50c40ed05 Temporary commit to aid in investigation of recent intermittent
IndelRealignerIntegrationTest failures -- yes, it's the classic printf()
debugging technique. Will revert in a day or two once I get the data I need :)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5896 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 20:01:57 +00:00
carneiro 260301016a cleaned up the scripts and created an interval library to facilitate future reuse.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5895 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 19:35:36 +00:00
carneiro 0048f1f6d3 Lots of interval_list file utility scripts
1. findGenes : Parses the Genetic Association Database (from NIH) into an annotated 'genes of interest' interval_list file.
 
2. sortGenesByCoverage : combines the interval_list of the genes of interest with the report from GATK's DepthOfCoverage, generating an annotated interval_list with total and average coverages on each gene.
  
3. hasTheseTargets : Give it a list of targets (example: exon targets) and any interval_list (example: genes of interest) and it will generate an annotated interval_list of all the exons that are contained in the list of genes. 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5894 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 18:31:07 +00:00
rpoplin 2227f49220 misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5893 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:49:20 +00:00
rpoplin 9e834391fe We now skip over all covering RODs in the BQSR as intended instead of just those which can be converted into a VariantContext. All the integration tests change because of subtleties in how certain dbsnp rod records are being converted into VCs. Added integration test which uses a bed file as the list of known polymorphic sites.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5892 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:32:17 +00:00