Ryan Poplin
cd9bf1bfc3
Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs.
2012-04-10 00:22:40 -04:00
Roger Zurawicki
9ece93ae9c
DiagnoseTargets now outputs a VCF file
...
- refactored the statistics classes
- concurrent callable statuses by sample are now available.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel
719ec9144a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-09 14:53:19 -04:00
Guillermo del Angel
550179a1f7
Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.
2012-04-09 14:53:05 -04:00
Eric Banks
d312fcdae8
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-09 14:28:49 -04:00
Eric Banks
f82986ee62
Adding unit tests for the very important log10sumLog10 util method.
2012-04-09 14:28:25 -04:00
Mark DePristo
63b080e353
Added index for end_time
2012-04-09 14:11:32 -04:00
Eric Banks
ea4300d583
Refactoring so that Unified Argument Collection doesn't use deprecated classes.
2012-04-09 13:45:17 -04:00
Eric Banks
6ddf2170b6
More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice.
2012-04-09 11:46:16 -04:00
Mauricio Carneiro
87e6bea6c1
Adding engine capability to quantize qualities.
...
* Added parameter -qq to quantize qualities using a recalibration report
* Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
* Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00
Mauricio Carneiro
c055752583
Updated analyze covariates R script for new GATKReport format
2012-04-06 17:08:09 -04:00
Mark DePristo
c22a66870c
Modified UnitTests to respect reference padding
2012-04-06 16:27:20 -04:00
Mark DePristo
3e29841776
Major improvements to VariantCallQC and PostCallingQC scripts
...
-- Now includes one bp and tandem duplication stratifications by default. The indel QC process is getting much better now.
-- Lots of refactoring to make the plots cleaner and easy to digest
2012-04-06 16:07:47 -04:00
Mark DePristo
45fc0ea98d
Improvements to indel analysis capabilities of VariantEval
...
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly:
// - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
// downstream frameshift, if we make the simplifying assumptions that 3 bp ins
// and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
// selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
// as for deletions, and certainly would expect consistency between in/dels that
// multiple methods find and in/dels that are unique to one method (since deletions
// are more common and the artifacts differ, it is probably worth looking at the totals,
// overlaps and ratios for insertions and deletions separately in the methods
// comparison and in this case don't even need to make the simplifying in = del functional assumption
-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
2012-04-06 16:07:46 -04:00
Mark DePristo
52ef4a3e26
Function to compute whether a VariantContext indel is part of a TandemRepeat
...
Returns true iff VC is an non-complex indel where every allele represents an expansion or
contraction of a series of identical bases in the reference.
The logic of this function is pretty simple. Take all of the non-null alleles in VC. For
each insertion allele of n bases, check if that allele matches the next n reference bases.
For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
as it must necessarily match the first n bases. If this test returns true for all
alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the
base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Mark DePristo
08fab49d30
Added function to get bases from the current base forward in the window in ReferenceContext
2012-04-06 16:07:46 -04:00
Ryan Poplin
d1f2fe956f
Changes to how active region triggering works around large insertions that manifest as high quality soft clips in read on one end of the insertion and low quality soft clips on the other end. The active region will now be large enough to capture the entire insertion. Need to somehow include unmapped reads with mapped mates to increase power for these events.
2012-04-06 13:17:41 -04:00
Ryan Poplin
c77104b815
Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL
2012-04-06 00:22:52 -04:00
Mauricio Carneiro
a19c27297f
continuing the BQSR triage...
...
* fixed the loading of the new reduced size reports
* reduced BQSR scala script memory to 2Gb
* removed dcov parameter from BQSR scala script
* fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
* updated md5's with the proper PHRED scaled EstimatedQReported
2012-04-05 14:34:15 -04:00
Guillermo del Angel
14f6b9cd16
Added two simple integration tests for pool caller with mtDNA calling - more test coverage to follow
2012-04-05 13:06:11 -04:00
Eric Banks
3561056a9c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-05 10:49:26 -04:00
Eric Banks
5c3ddec4c2
Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.
2012-04-05 10:49:08 -04:00
Mauricio Carneiro
7c3b3650bb
BQSR bug triage
...
* fixed bug where some keys were using the same recal datum objects
* fixed quantization qual calculations when combining multiple reports
* fixed rounding error with empirical quality reported when combining reports
* fixed combine routine in the gatk reports due to the primary keys being out of order
* added auto-recalibration option to BQSR scala script
* reduced the size of the recalibration report by ~15%
* updated md5's
2012-04-05 09:32:18 -04:00
Eric Banks
2c956efa53
Minor fixups to GenotypeLikelihoods
2012-04-05 09:14:37 -04:00
Mauricio Carneiro
1e65474fec
Added utility to get the reference coordinate given the read coordinate
2012-04-05 09:04:20 -04:00
Guillermo del Angel
82cad97caf
Fix bad refactoring in previous commit: correct renaming from VariantContextUtils.assignGenotypes to VariantContextUtils.assignDiploidGenotypes
2012-04-04 21:03:10 -04:00
Guillermo del Angel
6913710e89
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 20:17:18 -04:00
Guillermo del Angel
f2aafaa3f6
More cosmetic fixes/cleanups of pool caller: use utility functions from List class instead of trying to reinvent the wheel
2012-04-04 20:16:53 -04:00
Mark DePristo
76e4100d89
By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
...
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Guillermo del Angel
820216dc68
More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation
2012-04-04 16:23:10 -04:00
Ryan Poplin
bfad26353a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 16:04:50 -04:00
Ryan Poplin
dda2173c66
Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.
2012-04-04 16:04:29 -04:00
Mark DePristo
1473e166f7
Update VariantCallQC script
...
-- Fixed indel length histogram to work with new table style
-- Removed unused expectations list
-- by AC plots for indels include log10(n_Indels) as their weight
-- smoothing is now weighted by the log10 count of n_Indels
2012-04-04 15:37:44 -04:00
Mark DePristo
2cc1e8d871
PostCallingQC uses IndelLengthHistogram now
2012-04-04 15:37:43 -04:00
Mark DePristo
fcdd65a0f4
Bugfix for IndelLengthHistogram
...
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo
3593996a87
G1K summary table needs to use the -keepAC0 flag
...
-- AC = 0 sites look about as good as singletons, and are likely only AC 0 because they cannot be easily imputed. We keep them in our counting.
2012-04-04 15:37:43 -04:00
Mark DePristo
1ccea866d8
VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
...
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Guillermo del Angel
15e26fec04
Many cosmetic fixes to pool caller (not done yet): better docs, some code reorganization, change iterator inside PoolGenotypeLikelihoods so that we store conformations and convert to/from PL vectors in the same order as the non-pool case (it used to be flipped), to maintain better legibility. Improved unit tests (not done yet)
2012-04-04 15:08:19 -04:00
Eric Banks
9e32a975f8
Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.
2012-04-04 13:47:59 -04:00
Eric Banks
337ff7887a
When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.
2012-04-04 10:57:05 -04:00
Guillermo del Angel
1248f9025d
Clean up and fix bug in PoolGenotypeLikelihoodsUnitTest. b) Cosmetic fixes to PoolAFCalculationModel, don't print PL vectors per pool if they're too long or else vcf's are too hard to make sense of
2012-04-03 21:27:53 -04:00
Guillermo del Angel
05d8400468
Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)
2012-04-03 20:51:24 -04:00
Guillermo del Angel
5a10f173ea
Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)
2012-04-03 18:55:52 -04:00
Guillermo del Angel
61e1ec6cdd
More bug fixing PoolAFCalculationModel
2012-04-03 18:12:39 -04:00
Guillermo del Angel
baad840598
Bug fixing PoolAFCalculationModel
2012-04-03 17:03:42 -04:00
Guillermo del Angel
5abb07da5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 17:00:45 -04:00
Guillermo del Angel
4d33f63986
Bug fixing PoolAFCalculationModel
2012-04-03 16:58:12 -04:00
Christopher Hartl
a6837d31d4
Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.
2012-04-03 16:13:16 -04:00
Guillermo del Angel
63b1e737c6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 15:43:50 -04:00
Guillermo del Angel
9e11b4f9a7
Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.
2012-04-03 15:43:32 -04:00