Commit Graph

1091 Commits (f3e63f00bc9e682444c47d5f3de048caecb50207)

Author SHA1 Message Date
aaron 0087234ed7 small code cleanup, a couple of little changes to SSGGenotypeCall
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1343 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 19:47:37 +00:00
ebanks fbc7d44bc7 don't allow users to input priors anymore; they should be using heterozygosity and having the SSG calculate priors.
Note that nothing was changed for dnSNP/hapmap priors (not sure what we want to do with these yet - any thoughts?)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1342 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 19:10:33 +00:00
ebanks b282635b05 Complete reworking of Fisher's exact test for strand bias:
- fixed math bug (pValue needs to be initialized to pCutoff, not 0)
- perform factorial calculations in log space so that huge numbers don't explode
- cache factorial calculations so that each value needs to be computed just once for any given instance of the filter

I've tested it against R and it has held up so far...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1341 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 18:52:13 +00:00
aaron 4033c718d2 moving some code around for better organizations, some fixes to the fields out of SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1340 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 15:09:43 +00:00
ebanks 4366ce16e0 Made sure all RODs have a (good) toString() method - and use it in the Venn walker. (thanks, Mark)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1339 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 14:53:27 +00:00
aaron 9cd53d3273 some initial changes from the first review of the genotype redesign, more to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1338 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 07:04:05 +00:00
ebanks feb7238f10 Wasn't always returning the correct alt base
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1337 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 03:08:04 +00:00
hanna 5429b4d4a8 A bit of reorganization to help with more flexible output streams. Pushed construction of data
sources and post-construction validation back into the GATKEngine, leaving the MicroScheduler
to just microschedule.  


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1336 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 23:00:15 +00:00
aaron bca894ebce Adding the intial changes for the new Genotyping interface. The bullet points are:
- SSG is much simpler now
- GeliText has been added as a GenotypeWriter
- AlleleFrequencyWalker will be deleted when I untangle the AlleleMetric's dependance on it
- GenotypeLikelihoods now implements GenotypeGenerator, but could still use cleanup

There is still a lot more work to do, but this is a good initial check-in.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1335 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 19:43:59 +00:00
kiran c5c11d5d1c First attempt at modifying the VFW interfaces to support direct emission of relevant training data per feature and exclusion criterion. This way, you could run the program once, get the training sets, and then feed that training set back to the filters and have them automatically choose the optimal thresholds for themselves. This current version is pretty ugly right now...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1334 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 19:29:03 +00:00
ebanks 3554897222 allow filters to specify whether they want to work with mapping quality zero reads; the VariantFiltrationWalker passes in the appropriate contextual reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1333 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 17:38:15 +00:00
hanna 7a13647c35 Support for specifying SAMFileReaders and SAMFileWriters as @Arguments directly. *Very*
rough initial implementation, but should provide enough support so that people can stop
creating SAMFileWriters in reduceInit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1332 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 16:11:45 +00:00
depristo 56f769f2ce Output improvements to GenotypeConcordance calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1331 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 12:54:46 +00:00
ebanks 72dda0b85c Fixed calculations for Mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1330 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 03:21:43 +00:00
ebanks f0378db9b7 added accuracy numbers
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1329 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 01:38:33 +00:00
ebanks a5a56f1315 At this point, we are convinced that the new priors are the way to go...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1328 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-28 17:25:25 +00:00
depristo df4fd498c5 Improvements and bug fixes galore. (1) Now properly handles Q0 bases, filtering them out, you can disable this if you need to (2) support for three-state base probabilities (see email), which is disabled by default (still experimental) but appears to be more emppowered to detect variants (see email too)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1327 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-28 13:21:38 +00:00
depristo 46643d3724 Improvements and bug fixes galore. (1) Now properly handles Q0 bases, filtering them out, you can disable this if you need to (2) support for three-state base probabilities (see email), which is disabled by default (still experimental) but appears to be more emppowered to detect variants (see email too)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1326 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-28 13:21:27 +00:00
ebanks 3c4410f104 -add basic indel metrics to variant eval
-variants need a length method (can't assume it's a SNP)!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1324 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-28 03:25:03 +00:00
kcibul 1d6d99ed9c walk by reference
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1323 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 20:21:04 +00:00
ebanks 089ae85be7 1. output grep-able strings for genotype eval
2. free DB coverage from isSNP restriction


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1322 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 17:36:59 +00:00
kcibul 1bca9409a4 calculate freestanding intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1321 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 16:40:27 +00:00
asivache 2499c09256 added minIndelCount (short: minCnt) command line argument. The call is made only if the number of reads supporting the consensus indel is equal or greater than the specified value (default: 0, so only minFraction filter is on in default runs!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1320 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 15:22:51 +00:00
ebanks 73ddf21bb7 SNPs no longer fail this filter if they are actually hom in reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1319 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 15:20:43 +00:00
asivache f2b3fa83ac fix for another bug found by Eric: some indels were printed into the output stream twice (when there's another indel within MISMATCH_WINDOW bases and that other indel requires delayed print in order to accumulate coverage)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1318 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-27 15:07:07 +00:00
aaron f1109e9070 Added the interator to SAMDataSource to prevent seeing dupplicate reads, only in a byReads traversal. The iterator discards any reads in the current interval that would have been seen in the previous interval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1317 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-25 22:36:29 +00:00
asivache 5eca4c353c IndelGenotyper now uses GATK::getMergedReadGroupsByReaders() to sort out which read in the merged stream is for normal, and which is for tumor (in --somatic mode, apparently)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1316 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 23:01:18 +00:00
asivache a361e7b342 SAMDataSource is now exposed by GATK engine; SamFileHeaderMerger is exposed from Resources all the way up to SAMDataSource, so now we can see underlying individual readers should we need them; GATK engine has new methods getSamplesByReaders(), getLibrariesByReaders(), and getMergedReadGroupsByReaders(): each of these methods returns a list of sets, with each element (set) holding, respectively, samples, libraries, or (merged) read groups coming from an individual input bam file (so now when using multiple -I options we can still find out which of the input bams each read comes from)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1315 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 22:59:49 +00:00
hanna 2024fb3e32 Better division of responsibilities between sources and type descriptors.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1314 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 22:15:57 +00:00
asivache 64221907a2 fixed a bug found by Eric: genotyper would crash in the case of an indel too close to the window end, with the next read mapping sufficiently far away on the ref
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1313 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 21:00:31 +00:00
hanna 2db86b7829 Move the cleaned read injector test from playground to core. Remove CovariateCounterTest's dependency on the CleanedReadInjector. Start doing a bit of cleanup on the CLP's FieldParsers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1312 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 19:44:04 +00:00
hanna df44bdce7d Retire the pooled caller...its been eclipsed by other walkers in the tree.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1310 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 14:49:03 +00:00
kiran 884806fc16 Broken and unused. It goes away now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1309 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 14:26:52 +00:00
ebanks d044681fbe change paths to new ones
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1308 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 07:28:43 +00:00
ebanks 59f0c00d77 -set indel cleaning walkers to be in core package
-move Andrey's alignment utility classes to core


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1307 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 05:23:29 +00:00
kiran bb20462a7c A better way: down-scale second-base ratios until the infinities disappear. This way, high-coverage sites don't cause binomialProbability to explode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1306 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 03:02:00 +00:00
aaron 0b16253db3 an iterator to fix the problem where read-based interval traversals are getting duplicate reads because reads span the two intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1305 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 23:59:48 +00:00
kiran 7c20be157c Added ability to sample from a list *without* replacement.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1304 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 21:00:19 +00:00
kiran 038cbcf80e If the result from the secondary-base test is 0.0, replace the result with a minimum likelihood such that the log-likelihood doesn't underflow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1303 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 20:59:52 +00:00
kiran 093550a3f2 Removed secondary-base test from SingleSampleGenotyper. It now lives in the variant filtration system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1302 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 20:58:41 +00:00
ebanks 477502338f moved major indel cleaning pieces to core (yippee!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1301 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 19:59:51 +00:00
ebanks 4efe26c59a Major: allow genotyper to optionally output in 1KG format, including outputting the samples in which indels are found.
Minor: refactor 454 filtering


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1300 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 19:53:51 +00:00
ebanks f8b1dbe3b3 getBestGenotype() does not necessarily return hets in alphabetical order;
the string (unfortunately) needs to be sorted for lookup in the table (otherwise we throw a NullPointerException)
TO DO: have the table be smarter instead of sorting each genotype string


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1298 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 01:58:47 +00:00
ebanks ee8ed534e0 print full genotype for alt allele
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1297 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 01:35:23 +00:00
hanna 298cc24524 Fix minor bug introduced in filtration, and cleaned up the artificial sam records so that they use SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX and SAMRecord.NO_ALIGNMENT_START rather than hardcoded -1's.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1296 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-22 22:37:41 +00:00
hanna cac04a407a For Manny: filter out reads where the the ref index ==
NO_ALIGNMENT_REFERENCE_INDEX but the alignment start != NO_ALIGNMENT_START.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1295 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-22 21:19:24 +00:00
depristo 9c12c02768 AlleleBalance and on/off primary base filters -- version 0.0.1 -- for experimental use only
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1294 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-22 17:54:44 +00:00
ebanks c54fd1da09 Beautify the genotype concordance printouts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1291 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-22 02:53:02 +00:00
hanna 6e4fd8db4a Better formatting of available walkers, and only output them along with help. Cleanup JVMUtils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1290 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 22:23:28 +00:00
depristo 761d70faa1 Better printing of multiple rods -- now produces a comma-separated set of values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1289 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 21:58:27 +00:00