Eric Banks
5b7da3831f
Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG
2012-04-11 13:49:50 -04:00
Eric Banks
7aa654d13f
New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now
2012-04-11 13:49:09 -04:00
Eric Banks
dc90508104
Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.
2012-04-11 13:47:10 -04:00
Eric Banks
d2142c3aa7
Adding integration test for Flag Stat
2012-04-10 22:40:38 -04:00
Eric Banks
f560611fe8
Merged bug fix from Stable into Unstable
2012-04-10 22:26:53 -04:00
Eric Banks
f46f7d0590
Fix the stats coming out of FlagStat. I will add an integration test in unstable
2012-04-10 22:26:10 -04:00
Mauricio Carneiro
cd842b650e
Optimizing DiagnoseTargets
...
* Fixed output format to get a valid vcf
* Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
* Added support to overlapping intervals
* Removed expand target functionality (for now)
* Removed total depth (pointless metric)
2012-04-10 17:43:59 -04:00
Ryan Poplin
1df0adf862
Fixing ActivityProfile unit test.
2012-04-10 15:28:27 -04:00
Ryan Poplin
e3cc7cc59c
Resolving merge conflict.
2012-04-10 14:50:27 -04:00
Ryan Poplin
a4634624b7
There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.
2012-04-10 14:48:23 -04:00
Eric Banks
10e74a71eb
We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.
2012-04-10 12:30:35 -04:00
Mark DePristo
b43d21056b
Merged bug fix from Stable into Unstable
2012-04-10 09:42:09 -04:00
Mark DePristo
6885e2d065
UserException fixes for GATK_logs recent errors
...
-- SamFileReader.java:525
-- BlockCompressedInputStream:376
These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Mark DePristo
8507cd7440
Throw UserException for bad dict / chain files
2012-04-10 07:22:43 -04:00
Ryan Poplin
cd9bf1bfc3
Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs.
2012-04-10 00:22:40 -04:00
Roger Zurawicki
9ece93ae9c
DiagnoseTargets now outputs a VCF file
...
- refactored the statistics classes
- concurrent callable statuses by sample are now available.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel
719ec9144a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-09 14:53:19 -04:00
Guillermo del Angel
550179a1f7
Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.
2012-04-09 14:53:05 -04:00
Eric Banks
f82986ee62
Adding unit tests for the very important log10sumLog10 util method.
2012-04-09 14:28:25 -04:00
Eric Banks
ea4300d583
Refactoring so that Unified Argument Collection doesn't use deprecated classes.
2012-04-09 13:45:17 -04:00
Eric Banks
6ddf2170b6
More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice.
2012-04-09 11:46:16 -04:00
Mauricio Carneiro
87e6bea6c1
Adding engine capability to quantize qualities.
...
* Added parameter -qq to quantize qualities using a recalibration report
* Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
* Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00
Mark DePristo
c22a66870c
Modified UnitTests to respect reference padding
2012-04-06 16:27:20 -04:00
Mark DePristo
45fc0ea98d
Improvements to indel analysis capabilities of VariantEval
...
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly:
// - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
// downstream frameshift, if we make the simplifying assumptions that 3 bp ins
// and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
// selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
// as for deletions, and certainly would expect consistency between in/dels that
// multiple methods find and in/dels that are unique to one method (since deletions
// are more common and the artifacts differ, it is probably worth looking at the totals,
// overlaps and ratios for insertions and deletions separately in the methods
// comparison and in this case don't even need to make the simplifying in = del functional assumption
-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
2012-04-06 16:07:46 -04:00
Mark DePristo
52ef4a3e26
Function to compute whether a VariantContext indel is part of a TandemRepeat
...
Returns true iff VC is an non-complex indel where every allele represents an expansion or
contraction of a series of identical bases in the reference.
The logic of this function is pretty simple. Take all of the non-null alleles in VC. For
each insertion allele of n bases, check if that allele matches the next n reference bases.
For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
as it must necessarily match the first n bases. If this test returns true for all
alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the
base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Mark DePristo
08fab49d30
Added function to get bases from the current base forward in the window in ReferenceContext
2012-04-06 16:07:46 -04:00
Ryan Poplin
c77104b815
Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL
2012-04-06 00:22:52 -04:00
Mauricio Carneiro
a19c27297f
continuing the BQSR triage...
...
* fixed the loading of the new reduced size reports
* reduced BQSR scala script memory to 2Gb
* removed dcov parameter from BQSR scala script
* fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
* updated md5's with the proper PHRED scaled EstimatedQReported
2012-04-05 14:34:15 -04:00
Eric Banks
3561056a9c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-05 10:49:26 -04:00
Eric Banks
5c3ddec4c2
Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.
2012-04-05 10:49:08 -04:00
Mauricio Carneiro
7c3b3650bb
BQSR bug triage
...
* fixed bug where some keys were using the same recal datum objects
* fixed quantization qual calculations when combining multiple reports
* fixed rounding error with empirical quality reported when combining reports
* fixed combine routine in the gatk reports due to the primary keys being out of order
* added auto-recalibration option to BQSR scala script
* reduced the size of the recalibration report by ~15%
* updated md5's
2012-04-05 09:32:18 -04:00
Eric Banks
2c956efa53
Minor fixups to GenotypeLikelihoods
2012-04-05 09:14:37 -04:00
Mauricio Carneiro
1e65474fec
Added utility to get the reference coordinate given the read coordinate
2012-04-05 09:04:20 -04:00
Guillermo del Angel
6913710e89
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 20:17:18 -04:00
Mark DePristo
76e4100d89
By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
...
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Guillermo del Angel
820216dc68
More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation
2012-04-04 16:23:10 -04:00
Ryan Poplin
bfad26353a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 16:04:50 -04:00
Ryan Poplin
dda2173c66
Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.
2012-04-04 16:04:29 -04:00
Mark DePristo
fcdd65a0f4
Bugfix for IndelLengthHistogram
...
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo
1ccea866d8
VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
...
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Eric Banks
9e32a975f8
Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.
2012-04-04 13:47:59 -04:00
Eric Banks
337ff7887a
When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.
2012-04-04 10:57:05 -04:00
Guillermo del Angel
05d8400468
Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)
2012-04-03 20:51:24 -04:00
Guillermo del Angel
5a10f173ea
Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)
2012-04-03 18:55:52 -04:00
Guillermo del Angel
5abb07da5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 17:00:45 -04:00
Christopher Hartl
a6837d31d4
Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.
2012-04-03 16:13:16 -04:00
Guillermo del Angel
63b1e737c6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 15:43:50 -04:00
Guillermo del Angel
9e11b4f9a7
Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.
2012-04-03 15:43:32 -04:00
Eric Banks
f9ce9962c4
Minor changes to verbose mode
2012-04-03 10:53:48 -04:00
Eric Banks
f6aa95685d
OutOfMemory exceptions are User Errors
2012-04-02 22:46:56 -04:00