Commit Graph

4751 Commits (c2ec2891d1e185b4cc0f30e2dfd18991e2837b69)

Author SHA1 Message Date
rpoplin 0d6ce91614 When running CombineVariants with -mergeInfoWithMaxAC the set field will be added appropriately
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5974 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-10 14:35:48 +00:00
delangel f8ffda6835 a) Hidden, experimental argument to UnifiedGenotyper that makes code, when in GenotypeGivenAlleles mode, ignore SNP alleles mixed in with indels in complex records - theory is that SNP sites behave statistically differently when doing VQSR so those alleles/sites should be treated separately.
b) Bug fix: multiallelic indel records where not being treated properly by VQSR because vc.isIndel() returns false with them. Correct general treatment for now is to do (vc.isIndel()||vc.isMixed()).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5973 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-09 19:19:23 +00:00
rpoplin 17e17d3c3c Misc cleanup in VQSR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5972 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-09 18:37:37 +00:00
depristo ac3620839c Very basic intergration tests for ReducedReads, to allow safe optimization of the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5970 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-09 17:06:32 +00:00
rpoplin 895e86c544 Annotations used to build the 1000G consensus callsets are now standard annotations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5969 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-09 17:03:39 +00:00
depristo 93d6e17762 Final, documented version of CalibrateGenotypeLikelihoods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5966 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 20:22:28 +00:00
depristo 44287ea8dc ReducedBAM changes to downsample to a fixed coverage over the variable regions. Evaluation script now has filters and eval. commands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5965 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 19:36:08 +00:00
kiran 49b021d435 Changed the definition of degeneracy (it's at the site level - degeneracy of a position in a codon, not degeneracy of the amino acid itself like I initially thought. Added the ability to supply an ancestral allele track (available in /humgen/gsa-hpprojects/GATK/data/Ancestor/).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5963 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 15:07:31 +00:00
depristo a331e13721 Slightly more extensive test includes a 0/0 site to genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5961 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 14:48:55 +00:00
depristo 0f43b10c39 Optimization in CombineVariants when merging into a sites_only VCF
VariantContextUtils now was a utility function that creates a sitesOnlyVariantContext from an input VC
Add complex merge test of SNPs and indels from the new batch merge wiki in :

http://www.broadinstitute.org/gsa/wiki/index.php/Merging_batched_call_sets

with multiple alleles for an indel.  Created a BatchMergeIntegrationTest that uses GGA with the complex merged input alleles to genotype SNPs and Indels with multiple alleles simultaneously in NA12878.  Looks great.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5959 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 14:31:46 +00:00
delangel 1d6486a28f First part of fix for correctly processing mixed multi-allelic records: correctly compute start/stop of vc when there are no null alleles (i.e. record is not a simple indel).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5958 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 13:36:18 +00:00
delangel d27800e07c a) Forgot to commit this ages ago: uncomment code to ignore hard clipped bases when computing indel likelihoods. b) First part of fix for correctly processing mixed multi-allelic records: correctly compute start/stop of vc when there are no null alleles (i.e. record is not a simple indel).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5957 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 11:28:17 +00:00
hanna ad97099df6 Getting rid of a few extra, very explicit qualifications so that the public/
private bifurcation script doesn't have to discover them.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5956 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-08 03:08:47 +00:00
ebanks bb6c0db783 We found the cause of the inconsistency. Woo hoo!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5955 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-07 15:13:58 +00:00
hanna ca48ea78df At Picard team's request, generate md5s for generated BAM files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5954 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-07 04:25:40 +00:00
depristo 311dfa0998 Now builds examples, as I expected. GATKPaperGenotyper lives again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5953 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-07 00:13:44 +00:00
alecw 2901abf070 Switch from PriorityQueue to TreeSet for better and more consistent performance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5952 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-06 20:41:30 +00:00
ebanks 2c57721ed2 Updated printouts to help with debugging. Issue does appear to be deterministic though.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5950 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-06 01:04:07 +00:00
ebanks 27dfb53d26 We really don't want to be advising the user to use an unsafe option - really, they should fix their busted bam file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5949 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-05 05:18:16 +00:00
delangel 7e49e1668f Finished changing md5's due to recent change in definition of mixed and indel vc's.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5948 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-05 00:40:51 +00:00
delangel d534241f35 Major revamp of annotations for indels:
a) All rank sum tests now work for indels including multiallelic sites. For the latter cases, rank sum test is REF vs most common allele
b) Redid computation of HaplotypeScore for indels. It's now trivially easy to do because we are already computing likelihoods of each read vs haplotypes in GL computation so we reuse that if available. For multiallelic case, we score against N haplotypes where N is total called alleles.

Drawback is that all cases need information contained in likelihood table that stores likelihood for each pileup element, for each allele. If this table is not available we dont annotate, so we can only fully annotate indels right now when running UG but not when running VariantAnnotator alone.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5947 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 15:34:24 +00:00
delangel 1448a1f155 Change md5 because conversion of a tri-allelic dbsnp indel record is now legit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5946 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 11:24:16 +00:00
delangel 53667ce8fa Disabled test that checks whether output is the same whether in Genotype Given Alleles mode or not - it won't as long as extended events are finally removed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5945 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-04 00:52:54 +00:00
delangel 35df80de14 Updated md5 due to changes to changes in QUAL field when in Genotype given alleles mode w/indels when in insertions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5944 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 23:52:38 +00:00
ebanks b93829e505 The underlying bam file for this test was busted for many reasons preventing Picard folks from making unrelated changes, so I needed to fix it. Updating md5s accordingly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5943 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:26:06 +00:00
delangel a8faacda4e Major change to UG engine to support:
a) Genotype given alleles with indels
b) Genotyping and computing likelihoods of multi-allelic sites.

When GGA option is enabled, indels will be called on regular pileups, not on extended pileups (extended pileups will be removed shortly in a next iteration). As a result, likelihood computation is suboptimal since we can't see reads that start with an insertion right after a position, and hence quality of some insertions is removed and we could be missing a few marginal calls, but it makes everything else much simpler.
For multiallelic sites, we currently can't call them in discovery mode but we can genotype them and compute/report full PL's on them (annotation support comes in next commit). There are several suboptimal approximations made in exact model to compute this. Ideally, joint likelihood Pr(Data | AC1=i,AC2=j..) should be computed but this is hard. Instead, marginal likelihoods are computed Pr(Data | ACi=k) for all i,k, and QUAL is based on highest likelihood allele. 




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5941 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 22:13:07 +00:00
depristo cd293f145b More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:11:31 +00:00
ebanks 80cbc1924b Oops, just realized that I forgot to comment my commit from yesterday so it was clear what was happening
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5938 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:06:41 +00:00
fromer e4eb8087bc A VariantContext can now be isSymbolic. More importantly, multi-allelic variants are now properly handled in determining their type [using isMixed only if any of the biallelic variant types differ between the alt alleles].
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5937 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 18:02:47 +00:00
depristo b4c479bcb0 Support for reducedReads in the pileup and UG. Totally experimental -- the code interface could change, and so could the implementation. Only works for SNPs now. Pileup has contracts as well.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5936 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:39:01 +00:00
delangel 2df12472c2 One more step in commit to support multi-allelic indel genotyping and processing: utility class that supports multi-allelic genotype likelihoods
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5935 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 16:08:29 +00:00
ebanks 420d8feff6 No one should be calling the createHeader method(s) directly, but instead should be going through the full readHeader method (because it first sets the version); therefore I made them package protected and merged them. Updated the various unit tests that were using createHeader and were dangerously assuming that the header version was defaulting to 4.0 (which it no longer does).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5934 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 02:17:37 +00:00
depristo 86df10ec09 UnitTests for ConsensusSpan infrastructure
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:44:52 +00:00
fromer 74298f6858 Basic walker to calculate statistics of CNV genotyping copy counts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5927 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:46:35 +00:00
depristo ad9dca9137 Package updated. Copyrights added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:29:27 +00:00
depristo 3d628f06f0 moved to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:25:26 +00:00
depristo 429833c05a Intermediate commit (DVCS, where are you?) of a fully operational ReducedRead walker. Now results in minor differences in the raw calls (filtering is a different matter) in an exome but 20x less disk space than the full exome data. Changes to the UG necessary to process reduced reads are not yet committed, as they are being tested. This code is being moved to playground now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5924 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:13:31 +00:00
ebanks dd6d61c031 Adding integration test to cover the case of a read that only covers an insertion (i.e. no M in the CIGAR string).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5923 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:02:47 +00:00
ebanks d0ca6f8a9c Patch for case that a read spans only an insertion (i.e. no Ms in the CIGAR string): the end position should not be less than the start position (which is how Picard defines it) but instead should be equal to it. This is just a patch; we'll get a proper solution in at some point.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5922 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 20:40:56 +00:00
ebanks 3302a733ef Fixed docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5920 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 16:02:14 +00:00
chartl 84c2c5d7e6 Stop running away from my commits, test modules.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5919 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:05:53 +00:00
chartl 092952db44 After verifying that the changes to these tests were all in the RankSum annotations, I'm commiting fixes to the test md5s.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5918 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 13:01:18 +00:00
ebanks c7fe062cb7 Refactored the VCF codec classes to minimize code duplication (which happened during the VCF3/4 split). Now, both codecs extend the AbstractVCFCodec class and all shared functionality exists there. Only methods that differ between the various codecs (e.g. because FILTER strings are encoded differently) are defined in the actual codecs. While I was in there, I put in checks for invalid empty inputs in the ID, FILTER, and INFO fields.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5917 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:40:47 +00:00
ebanks 81d9808eea Next version of test output for non-determinism
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5916 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 19:36:56 +00:00
chartl 511cd48d7a There is an edge case ( |Set1| = 5, |Set2| = 4) where the exact p-value exceeds the range of the normal distribution we want to invert. For the edge cases, this happens exactly at the mean, and so this can be safely replaced with a z value of 0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5915 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 17:30:09 +00:00
carneiro dcd13060e1 created wiki page for Print Reads and changed help to match wiki.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5914 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:26:32 +00:00
droazen 8f6af299d8 Remove what is hopefully the last of the evil core -> playground dependencies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5913 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:22:35 +00:00
carneiro 8f3e8f934d added a quick option to print the first n reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5912 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 16:16:50 +00:00
chartl a79967d9af After extensive testing of MannWhitneyU:
- Verified that exact calculations do agree with R's dwilcox()
 - Verified that exact calculations do not agree with R's wilcox.test
   + This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values)
 - Can now specify MWU to calculate cumulative exact tests, rather than point probabilities
 - Z-scores are now calculated properly for exact tests
   + Previously, z-values calculated by inverting normal CDF from U-statistic PDF
   + Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate
   + Additional tests



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 15:51:27 +00:00
rpoplin 2b5683909e Updated VQSR integration tests because of the new Omni file. Fixed overflow condition in FisherStrand when the depth is too high.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5910 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 14:20:37 +00:00
hanna 6cc84c3ce2 Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can
continue to exist and live in playground (and thus outside of the normal release
 / git release branch).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 02:54:55 +00:00
ebanks 44cb7e4980 Renaming to make grepping through the output less confusing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5908 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:54:44 +00:00
ebanks b75583a90b Adding debug statements for David to aid in testing the non-determinism problem. I wouldn't recommend running with --stats temporarily (or ever in fact, which is why it's @Hidden).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5907 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 19:53:59 +00:00
droazen c50d290133 Removing printf's used for debugging -- they have served their purpose.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5906 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-31 14:06:37 +00:00
delangel 0aef5c0074 Totaly experimental, possibly useless annotation that logs # of MQ0 reads / total depth, TBD if VQSR can use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5905 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-30 14:05:39 +00:00
kiran b4d379584c Commented out the generation of the GATKReport that I was using for debugging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5903 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:15:09 +00:00
kiran 2a9c75c5ba Throw an exception if the programmer tries to access a column that doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5902 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:08:48 +00:00
kiran f3b38c0d3e Fixed a bug in my math where I assumed the genotype likelihoods were normalized to 1.0 when they in fact are not. *Now* genotypes get altered when a different genotype configuration leads to a more consistent answer with regards to inheritance constraints. There's the question of what to do when two configurations are almost equally likely - I should probably filter those events out. But currently there is no threshold on the transmission probability.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5901 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:08:05 +00:00
carneiro 5974675b43 Two intermediate commits, to work over the weekend.
ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model.
dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:03:08 +00:00
carneiro 69d9b5989f documenting this walker as it may be useful to others in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5899 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 21:58:51 +00:00
droazen a50c40ed05 Temporary commit to aid in investigation of recent intermittent
IndelRealignerIntegrationTest failures -- yes, it's the classic printf()
debugging technique. Will revert in a day or two once I get the data I need :)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5896 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 20:01:57 +00:00
rpoplin 2227f49220 misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5893 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:49:20 +00:00
rpoplin 9e834391fe We now skip over all covering RODs in the BQSR as intended instead of just those which can be converted into a VariantContext. All the integration tests change because of subtleties in how certain dbsnp rod records are being converted into VCs. Added integration test which uses a bed file as the list of known polymorphic sites.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5892 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:32:17 +00:00
depristo 8ed82e5a08 The previous version of the UG was always creating BAQ'd pileups for the underlying site QUAL calculation. This resulted in some slowdown in the code. But as far as I can tell, the code actually didn't apply the BAQ'd base quality anywhere when the BAQ field wasn't in the read, so this just saves us 20% of the runtime when BAQ isn't enabled from heading into the BAQ subsystem when we don't actually want to get the BAQ'd base qualities.
Fixed minor problem with WalkerTest for "" (for parameterization) md5s.
Added an explicit integrationtest for BAQ NONE
Now only creates the BAQ'd pileup, if the useBAQPileup parameter is provide in initializeAlternateAllele.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5891 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 14:00:52 +00:00
depristo 136c8c7900 ClipReads now supports HARDCLIP_BASES, though in fact this turned out to be not necessary for my desired tests. In the process of developing the HARDCLIP mode, I added some proper ReadUtils unit tests, which would ideally be expanded to include other ReadUtil functions, as added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5890 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 11:42:22 +00:00
hanna a77ca2d36a Incorporating Guillermo's patch to eliminate compile-time dependency of (core) UG indel model
on oneoffs.  Thanks Guillermo!  We'll polish the patch when you free up a bit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5888 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 02:22:19 +00:00
delangel 6ecbfa9013 OK, this time REALLY fix cut and paste error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5880 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 19:47:12 +00:00
delangel efe6602827 Fix copy-paste error from previous commit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5878 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 16:02:08 +00:00
delangel 7a43673599 Bug fix: also enclose fetching FS or HRun in a try/catch block or else code will blow up if an annotation is absent (e.g. when there no evidence for a variant in a vc)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5877 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 15:00:36 +00:00
delangel f7298f4a7f First of many baby steps to redo way in which we trigger events for indel calling and to eliminate extended events: get rid of SpanningDeletions annotation for indels. It's completely useless, and even more so once we no longer trigger at extended events (because we'll trigger by definition a base before a deletion starts, so deletions present in the current pileup are not informative).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5876 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 00:49:23 +00:00
ebanks bafdd4f8f7 Ask for existance of extended pileup before grabbing it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5874 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-25 17:39:03 +00:00
ebanks 6ed71cf683 Annotation that adds a list of samples who are polymorphic at a site based on the GTs. Very useful if you are looking at rare variants among many samples, esp. in Evoker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5868 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 20:12:27 +00:00
depristo 1bd1404aa9 Sometimes md5s can be null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5867 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 19:17:18 +00:00
depristo e582a92af6 WalkerTest now checks for valid md5s in the integrationtests themselves, so no more stray whitespace errors. Added a WalkerTestTest to ensure tha t bad MD5s are detected and an error thrown
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5865 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 14:34:55 +00:00
hanna 06486c134a Kill extra space in the md5.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5863 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 12:00:31 +00:00
depristo 57e4693e4c Slightly better error message when failing to create the index on the fly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5861 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 11:04:08 +00:00
depristo cf3dbfee97 Renamed variantMergeOptions to filteredRecordsMergeType, as this is really what it does. Cleaned up the wiki so that it's clear what this does, as well as included an example of how to create an intersection with CombineVariants and SelectVariants. Added integrationtests of CombineVariants with OMNI and HapMap that deal with the two ways to merge fitlered/unfiltered records at the same site.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5860 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 01:54:29 +00:00
kiran 653475ce12 Now finds the most likely configuration of genotypes given the genotype likelihoods and inheritance constraints. The parental genotypes are now phased as well (the alleles are ordered as A_transmitted|A_untransmitted). Rewrote the way the transmission probability is calculated. This will probably move into core soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5859 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 01:35:40 +00:00
hanna 4bfec4c55b Reenabling E.coli ValidatingPileup with MV1994 realigned using the BWA/C bindings.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5856 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 21:32:53 +00:00
chartl c7f4674fe2 Great! Contracts is working. Fixing some misspecified ones.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5854 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 21:00:52 +00:00
hanna 5dca1e4d2e Make IntervalIntegrationTest aware of the new alignments in the MV1994.bam
testset.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5852 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 19:59:47 +00:00
chartl 7ff5375493 Removing build-killing dependency on a private package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5851 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 18:13:15 +00:00
chartl 0b07373909 Incorporating old feedback from eric: @deprecated methods should not be @deprecated, but rather protected, and the test's package moved to where it can access those test methods.
Also allows for the slightly more awesome name "MWUnitTest"



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 18:06:05 +00:00
kiran f8f37a786d Now emits much more informative filter names and includes all of other the proper VCF header details (filter description line, tag definitions, etc.). Currently rewriting the way the transmission probability is calculated. This is shaping up to be a lovely little piece of code...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5849 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 17:50:59 +00:00
chartl 15dc632570 The U-value can be zero (edge case)
z-value can not be NaN (and can't possibly be null)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5847 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 14:15:36 +00:00
chartl 3c31007da4 Stupid brackets. How did this even compile?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5846 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 14:00:53 +00:00
chartl 480859db50 Contractified version of MannWhitneyU. Some behavior has been changed:
- Running a test when there are no observations of at least one of the sets now breaks the MWU contract
   + MWU returns Pair(Double.NaN,Double.NaN) in these instances to maintain the contract of never returning null
   + No more Double.Infinity values will appear
 - RankSumTests now probe the return values for NaNs, and don't annotate if they appear
 - For small sets where the probability is calculated recursively, the z-value is now the inversion of the error function
    and not the approximate z-value
 - UG and Annotator integration tests updated to reflect changes



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5845 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 13:57:15 +00:00
depristo b814f4bbd6 Contracts for HasGenomeLocation. BAQ iterator variables are all final. Contracts added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5844 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:21:59 +00:00
depristo 43057bd15c Remove Param annotation and associated broken processing code, as this was never used in the codebase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5843 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:21:15 +00:00
depristo d005c4bf09 GenomeLocProcessingTracker was using SimpleTimer in a non-thread safe way. No longer providing an interface to time parallel operations. Now issues warning if someone enables distributed GATK, as this is considered an unstable, experimental engine feature.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5842 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:10:27 +00:00
depristo a18b0152df Contracts for SimpleTimer, as well as UnitTests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5841 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 19:45:31 +00:00
depristo 0dc0d586f1 Phasing-specific utilies are now in the Phasing walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5839 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:51:35 +00:00
depristo f608ed6d5a Removed old (and unused) reporting system, now that Kiran's VE reporting system is working. Refactors dictionary creation error messages into UserExceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5836 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:42:52 +00:00
rpoplin 4e7ecbdcb2 FS values need to be jittered just like HRun
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5835 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 16:44:12 +00:00
depristo 9cc049f80f Contracted ReferenceContext. Removed depreciated accessors that aren't used in the GATK at all
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5834 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 02:41:15 +00:00
depristo d77f4ebe31 CalibrateGenotypeLikelihoods now emits a molten data set with REF and ALT alleles, so that GL calibration can be evaluated as a function of the REF/ALT bases. DigestTable is a stand-alone Rscript that digests the multi-GB molten data table into a tiny table that shows reported vs. empirical GLs, as a function of a variety of features of the data, like REF/ALT, comp GT, eval GT, and GL itself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5833 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 14:02:30 +00:00
depristo 6a49e8df34 Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below.
Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost.  For example, when selecting down NA12878 from the OMNI, I previously received the following VCF:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site.  My change results in a VCF that looks like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

The genotype remains unchanged, but the ALT allele is now preserved.  I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation.  This is related to the tricky issue of isPolymorphic() vs. isVariant().  

isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic.  Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.  Wiki docs updated.

Does anyone have critical infrastructure that depends on the previous convention?  Let me know so we can coordinate the change.

There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:59:16 +00:00
depristo 8377424089 Basic error checking to ensure incoming arguments are provided correctly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5831 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:43:48 +00:00
depristo e234589240 Contracts for GenomeLocParser and GenomeLoc are now fully implemented.
GenomeLocs can officially have any start/stop values from -Inf - +Inf.  Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser.  General code cleanup throughout the subsystem.

All validation code for GLs is now centralized, and all I/O systems now validate their inputs.  Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals.  Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20).  

Many interval processing routines were moved to IntervalUtils, as this is their natural home.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 02:01:59 +00:00
kiran 3aa56037af If asked, filters out triple-het situations too (which cannot be simply phased by transmission).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5829 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 18:48:19 +00:00