depristo
311dfa0998
Now builds examples, as I expected. GATKPaperGenotyper lives again.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5953 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-07 00:13:44 +00:00
alecw
2901abf070
Switch from PriorityQueue to TreeSet for better and more consistent performance.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5952 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-06 20:41:30 +00:00
delangel
78f5309656
Intermediate commit of indel consensus VQSR script, a couple of new features added, not for general use
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5951 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-06 13:27:02 +00:00
ebanks
2c57721ed2
Updated printouts to help with debugging. Issue does appear to be deterministic though.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5950 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-06 01:04:07 +00:00
ebanks
27dfb53d26
We really don't want to be advising the user to use an unsafe option - really, they should fix their busted bam file.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5949 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-05 05:18:16 +00:00
delangel
7e49e1668f
Finished changing md5's due to recent change in definition of mixed and indel vc's.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5948 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-05 00:40:51 +00:00
delangel
d534241f35
Major revamp of annotations for indels:
...
a) All rank sum tests now work for indels including multiallelic sites. For the latter cases, rank sum test is REF vs most common allele
b) Redid computation of HaplotypeScore for indels. It's now trivially easy to do because we are already computing likelihoods of each read vs haplotypes in GL computation so we reuse that if available. For multiallelic case, we score against N haplotypes where N is total called alleles.
Drawback is that all cases need information contained in likelihood table that stores likelihood for each pileup element, for each allele. If this table is not available we dont annotate, so we can only fully annotate indels right now when running UG but not when running VariantAnnotator alone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5947 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 15:34:24 +00:00
delangel
1448a1f155
Change md5 because conversion of a tri-allelic dbsnp indel record is now legit
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5946 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 11:24:16 +00:00
delangel
53667ce8fa
Disabled test that checks whether output is the same whether in Genotype Given Alleles mode or not - it won't as long as extended events are finally removed.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5945 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 00:52:54 +00:00
delangel
35df80de14
Updated md5 due to changes to changes in QUAL field when in Genotype given alleles mode w/indels when in insertions.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5944 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 23:52:38 +00:00
ebanks
b93829e505
The underlying bam file for this test was busted for many reasons preventing Picard folks from making unrelated changes, so I needed to fix it. Updating md5s accordingly.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5943 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:26:06 +00:00
kshakir
ac3f1be7f0
Added a samtools merge CLF.
...
Using samtools to merge the low pass bams before cleaning to avoid "Too many open files." with 1500+ bams.
Other minor cleanup as pointed out by the IntelliJ scala plugin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5942 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:20:38 +00:00
delangel
a8faacda4e
Major change to UG engine to support:
...
a) Genotype given alleles with indels
b) Genotyping and computing likelihoods of multi-allelic sites.
When GGA option is enabled, indels will be called on regular pileups, not on extended pileups (extended pileups will be removed shortly in a next iteration). As a result, likelihood computation is suboptimal since we can't see reads that start with an insertion right after a position, and hence quality of some insertions is removed and we could be missing a few marginal calls, but it makes everything else much simpler.
For multiallelic sites, we currently can't call them in discovery mode but we can genotype them and compute/report full PL's on them (annotation support comes in next commit). There are several suboptimal approximations made in exact model to compute this. Ideally, joint likelihood Pr(Data | AC1=i,AC2=j..) should be computed but this is hard. Instead, marginal likelihoods are computed Pr(Data | ACi=k) for all i,k, and QUAL is based on highest likelihood allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5941 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:13:07 +00:00
kshakir
4c6751ec3c
Added argument to WGP and HSP to allow more memory.
...
Upped the WGP VQSR memory to 32g to power through the filtering whole genome. TODO: Figure out what the right amount is.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5940 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:48:37 +00:00
depristo
cd293f145b
More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:11:31 +00:00
ebanks
80cbc1924b
Oops, just realized that I forgot to comment my commit from yesterday so it was clear what was happening
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5938 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:06:41 +00:00
fromer
e4eb8087bc
A VariantContext can now be isSymbolic. More importantly, multi-allelic variants are now properly handled in determining their type [using isMixed only if any of the biallelic variant types differ between the alt alleles].
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5937 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:02:47 +00:00
depristo
b4c479bcb0
Support for reducedReads in the pileup and UG. Totally experimental -- the code interface could change, and so could the implementation. Only works for SNPs now. Pileup has contracts as well.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5936 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:39:01 +00:00
delangel
2df12472c2
One more step in commit to support multi-allelic indel genotyping and processing: utility class that supports multi-allelic genotype likelihoods
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5935 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:08:29 +00:00
ebanks
420d8feff6
No one should be calling the createHeader method(s) directly, but instead should be going through the full readHeader method (because it first sets the version); therefore I made them package protected and merged them. Updated the various unit tests that were using createHeader and were dangerously assuming that the header version was defaulting to 4.0 (which it no longer does).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5934 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 02:17:37 +00:00
carneiro
32ac7be86a
new name to the pipeline, it's now in core, happy to support it.
...
ps: Can't wait for GIT !
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5933 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:34:54 +00:00
carneiro
a4ffae880d
Subversion crashed my intellij BADLY, so now moving the data processing pipeline to core in 2 steps.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5932 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:31:24 +00:00
carneiro
36db9bdcd5
Implemented and tested BWA alignment in the data processing pipeline.
...
caveat: Right now bwa only supports one read group, so if the original file had multiple @RG lines, only the first one will be kept. (working on a solution to this)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5931 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:03:07 +00:00
carneiro
c85a1d9210
Implemented and tested BWA alignment in the data processing pipeline.
...
Renamed it and moved to core. Happy to support it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5930 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:58:55 +00:00
depristo
86df10ec09
UnitTests for ConsensusSpan infrastructure
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:44:52 +00:00
fromer
ef56b48eef
Add CNV sub-dir
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5928 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:47:13 +00:00
fromer
74298f6858
Basic walker to calculate statistics of CNV genotyping copy counts
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5927 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:46:35 +00:00
depristo
ad9dca9137
Package updated. Copyrights added
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:29:27 +00:00
depristo
3d628f06f0
moved to playground
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:25:26 +00:00
depristo
429833c05a
Intermediate commit (DVCS, where are you?) of a fully operational ReducedRead walker. Now results in minor differences in the raw calls (filtering is a different matter) in an exome but 20x less disk space than the full exome data. Changes to the UG necessary to process reduced reads are not yet committed, as they are being tested. This code is being moved to playground now.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5924 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:13:31 +00:00
ebanks
dd6d61c031
Adding integration test to cover the case of a read that only covers an insertion (i.e. no M in the CIGAR string).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5923 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:02:47 +00:00
ebanks
d0ca6f8a9c
Patch for case that a read spans only an insertion (i.e. no Ms in the CIGAR string): the end position should not be less than the start position (which is how Picard defines it) but instead should be equal to it. This is just a patch; we'll get a proper solution in at some point.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5922 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 20:40:56 +00:00
carneiro
355be57539
fixing the pipeline so that it still works while I'm adding support for BWA.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5921 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 19:32:28 +00:00
ebanks
3302a733ef
Fixed docs
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5920 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 16:02:14 +00:00
chartl
84c2c5d7e6
Stop running away from my commits, test modules.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5919 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:05:53 +00:00
chartl
092952db44
After verifying that the changes to these tests were all in the RankSum annotations, I'm commiting fixes to the test md5s.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5918 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:01:18 +00:00
ebanks
c7fe062cb7
Refactored the VCF codec classes to minimize code duplication (which happened during the VCF3/4 split). Now, both codecs extend the AbstractVCFCodec class and all shared functionality exists there. Only methods that differ between the various codecs (e.g. because FILTER strings are encoded differently) are defined in the actual codecs. While I was in there, I put in checks for invalid empty inputs in the ID, FILTER, and INFO fields.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5917 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:40:47 +00:00
ebanks
81d9808eea
Next version of test output for non-determinism
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5916 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:36:56 +00:00
chartl
511cd48d7a
There is an edge case ( |Set1| = 5, |Set2| = 4) where the exact p-value exceeds the range of the normal distribution we want to invert. For the edge cases, this happens exactly at the mean, and so this can be safely replaced with a z value of 0.0
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5915 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 17:30:09 +00:00
carneiro
dcd13060e1
created wiki page for Print Reads and changed help to match wiki.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5914 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:26:32 +00:00
droazen
8f6af299d8
Remove what is hopefully the last of the evil core -> playground dependencies.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5913 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:22:35 +00:00
carneiro
8f3e8f934d
added a quick option to print the first n reads.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5912 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:16:50 +00:00
chartl
a79967d9af
After extensive testing of MannWhitneyU:
...
- Verified that exact calculations do agree with R's dwilcox()
- Verified that exact calculations do not agree with R's wilcox.test
+ This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values)
- Can now specify MWU to calculate cumulative exact tests, rather than point probabilities
- Z-scores are now calculated properly for exact tests
+ Previously, z-values calculated by inverting normal CDF from U-statistic PDF
+ Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate
+ Additional tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 15:51:27 +00:00
rpoplin
2b5683909e
Updated VQSR integration tests because of the new Omni file. Fixed overflow condition in FisherStrand when the depth is too high.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5910 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 14:20:37 +00:00
hanna
6cc84c3ce2
Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can
...
continue to exist and live in playground (and thus outside of the normal release
/ git release branch).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 02:54:55 +00:00
ebanks
44cb7e4980
Renaming to make grepping through the output less confusing
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5908 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:54:44 +00:00
ebanks
b75583a90b
Adding debug statements for David to aid in testing the non-determinism problem. I wouldn't recommend running with --stats temporarily (or ever in fact, which is why it's @Hidden).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5907 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:53:59 +00:00
droazen
c50d290133
Removing printf's used for debugging -- they have served their purpose.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5906 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 14:06:37 +00:00
delangel
0aef5c0074
Totaly experimental, possibly useless annotation that logs # of MQ0 reads / total depth, TBD if VQSR can use it.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5905 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-30 14:05:39 +00:00
kshakir
8d294dd6e6
For the snps to create combine snps and filtered indels, now using a VCF with just snps instead of vcf with snps plus unfiltered indels.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5904 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-29 04:17:18 +00:00