Commit Graph

2045 Commits (27b1aa5dd3b8ecfc84aa96b810defc4c0aa2d7e9)

Author SHA1 Message Date
Guillermo del Angel 1ae2ab5b63 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 12:50:29 -04:00
Guillermo del Angel 0e6e0cb907 Merging bug fixes 2012-04-19 12:49:30 -04:00
Eric Banks 79272c5e15 Thanks to Menachem for pointing out that the docs for genotyping_mode and output_mode were the same (and unclear). Fixed. 2012-04-19 12:48:09 -04:00
Guillermo del Angel 02ff930f6a My changes 2012-04-19 12:45:18 -04:00
Eric Banks 2485cef5b8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 11:46:06 -04:00
Eric Banks 76a6e37f4f Don't output callability metrics by default anymore; one can still have them output to the 'metrics' file (which is now @Hidden because they are really for GSA use). Added a TODO to move UG from @By reference to reads and rods once LIBS is cleaned up. 2012-04-19 11:45:56 -04:00
Ryan Poplin 1ea4e48a27 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 11:32:32 -04:00
Ryan Poplin 11001ab9a2 Adding option to HaplotypeCaller to genotype the events on the chosen haplotypes as independent events. The filtered reads are now kept around so they can be passed to the variant annotations. Unfortunately the filtered reads aren't assigned a likelihood yet so they are all thrown in the Allele.NO_CALL bin. 2012-04-19 11:32:10 -04:00
Mauricio Carneiro eb22cd7222 Unit test to guarantee BQSR sequential calculation accuracy
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
2012-04-19 09:33:40 -04:00
Mauricio Carneiro 68d0211fa1 Improved BQSR plotting and some new parameters
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
   * Refactored the CycleCovariateUnitTest to test the pairing information
   * Updated BQSR Integration tests accordingly
   * Made quantization levels parameter not hidden anymore
   * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
   * Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00
Guillermo del Angel 143e92b797 Rebasing 2012-04-18 20:05:43 -04:00
Guillermo del Angel 960e7e6aaf Changes to integration tests 2012-04-18 19:53:42 -04:00
Guillermo del Angel 82efd4457e Revert some bad merge changes 2012-04-18 16:35:09 -04:00
Guillermo del Angel 31c394d588 Resolve merge conflicts 2012-04-18 16:25:03 -04:00
Ryan Poplin 4999ae87ad Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-18 15:02:42 -04:00
Ryan Poplin dcc4871468 minor misc optimizations to PairHMM 2012-04-18 15:02:26 -04:00
Eric Banks d3c84e7b1f This should be a User Error since it's provided from the DoC command-line arguments 2012-04-18 13:09:23 -04:00
Eric Banks 392f1903f7 Handling some of the NumberFormatExceptions seen via Tableau that are really user errors. 2012-04-18 12:57:37 -04:00
Ryan Poplin 8a84456626 Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible. 2012-04-18 11:24:04 -04:00
Eric Banks 4448a3ea76 Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position. 2012-04-17 23:54:10 -04:00
Eric Banks c1f52b773a Minor tweaks and updated integration tests MD5s 2012-04-17 23:17:28 -04:00
Eric Banks 6d03bce0d3 Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file.
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration.  For 1000G calling this was prohibitive in terms of memory requirements.  Now we go through the rod system and pull in just the records we need at a given position.

As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
2012-04-17 22:38:18 -04:00
Eric Banks ea793d8e27 Khalid pressured me into adding an integration test that makes sure we don't fail on reads with adjacent I and D events. 2012-04-17 21:21:29 -04:00
Mauricio Carneiro 46a212d8e9 Added "simplify reads" option to PrintReads. 2012-04-17 19:32:34 -04:00
Mauricio Carneiro f0c81b59b0 Implementation of the new BQSR plotting infrastructure
* removed low quality bases from the recalibration report.
   * refactored the Datum (Recal and Accuracy) class structure
   * created a new plotting csv table for optimized performance with the R script
   * added a datum object that carries the accuracy information (AccuracyDatum) for plotting
   * added mean reported quality score to all covariates
   * added QualityScore as a covariate for plotting purposes
   * added unit test to the key manager to operate with one required covariate and multiple optional covariates
   * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
2012-04-17 19:23:55 -04:00
Ryan Poplin 952280bef1 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-17 17:00:14 -04:00
Ryan Poplin cf705f6c62 Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller 2012-04-17 17:00:00 -04:00
Eric Banks 13c800417e Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code. 2012-04-17 15:51:23 -04:00
Guillermo del Angel c78b0eee3a Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller. 2012-04-17 14:22:48 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Ryan Poplin 1a2e92f8db Merged bug fix from Stable into Unstable 2012-04-17 10:23:05 -04:00
Ryan Poplin adad76b36f Fixing NPE in VQSR for the case of very small callsets. 2012-04-17 10:20:43 -04:00
Mark DePristo 3f6b2423d8 Update VE IT to reflect new fields and bugfixes 2012-04-13 17:00:37 -04:00
Mark DePristo f9190b6fcd VariantEvalUnitTest is better named VariantEvalWalkerUnitTest 2012-04-13 17:00:37 -04:00
Mark DePristo 23ccf772d4 IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes 2012-04-13 17:00:36 -04:00
Mark DePristo 84d1e8713a Infrastructure for combining VariantEvaluations
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Mark DePristo 38986e4240 Documentation for StratificationManager 2012-04-13 17:00:36 -04:00
Mark DePristo ab06d53867 Useful test constructor or Unit tests in RefMetaDataTracker 2012-04-13 17:00:36 -04:00
Mark DePristo 285e61a227 Bugfix for IndelSummary
-- multi allelic count should be % not ratio
2012-04-13 17:00:35 -04:00
Mark DePristo e6d5cb46d2 Improvements and bugfixes to IndelSummary
-- Now properly includes both bi and multi-allelic variants.  These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
2012-04-13 17:00:35 -04:00
Mark DePristo bfa966a4e9 Bugfix for OneBPIndel
-- Previously was only including 1 bp insertions in stratification
2012-04-13 17:00:35 -04:00
Mark DePristo 2aa2d9aec0 Merged bug fix from Stable into Unstable 2012-04-13 09:25:43 -04:00
Mark DePristo 27e7e17dc7 New way to handle exceptions in multi-threaded GATK
-- HMS no longer tries to grab and throw all exceptions.  Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
2012-04-13 09:23:33 -04:00
Mark DePristo e85e9a8cf5 More extensive testing of type of error thrown in multi-threaded walker test
-- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown
-- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs
2012-04-13 09:23:33 -04:00
Eric Banks 297afc7911 Added unit test to ensure that we genotype correctly cases with really large GLs 2012-04-12 15:43:14 -04:00
Eric Banks 818e8c2fb9 Resolving merge conflicts 2012-04-12 15:19:44 -04:00
Eric Banks 0dd571928d Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones). 2012-04-12 15:16:29 -04:00
Eric Banks f77a6d18b8 Bad conflict merge before 2012-04-12 09:56:49 -04:00
Eric Banks 33a8bdd75f Resolving merge conflicts 2012-04-12 09:51:55 -04:00
Eric Banks b659b16b31 Generate User Error for bad POS value 2012-04-12 09:49:35 -04:00
Eric Banks cc71baf691 Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing. 2012-04-12 09:18:44 -04:00
Eric Banks 5bf9dd2def A framework to get annotations working in the HaplotypeCaller (and ART walkers in general).
Adding support for active-region-based annotation for most standard annotations.  I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.

IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller).  The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system.  When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
2012-04-11 16:22:12 -04:00
Guillermo del Angel f9f8589692 Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller. 2012-04-11 13:56:51 -04:00
Eric Banks 5b7da3831f Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG 2012-04-11 13:49:50 -04:00
Eric Banks 7aa654d13f New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now 2012-04-11 13:49:09 -04:00
Eric Banks dc90508104 Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful. 2012-04-11 13:47:10 -04:00
Eric Banks d2142c3aa7 Adding integration test for Flag Stat 2012-04-10 22:40:38 -04:00
Eric Banks f560611fe8 Merged bug fix from Stable into Unstable 2012-04-10 22:26:53 -04:00
Eric Banks f46f7d0590 Fix the stats coming out of FlagStat. I will add an integration test in unstable 2012-04-10 22:26:10 -04:00
Mauricio Carneiro cd842b650e Optimizing DiagnoseTargets
* Fixed output format to get a valid vcf
   * Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
   * Added support to overlapping intervals
   * Removed expand target functionality (for now)
   * Removed total depth (pointless metric)
2012-04-10 17:43:59 -04:00
Ryan Poplin 1df0adf862 Fixing ActivityProfile unit test. 2012-04-10 15:28:27 -04:00
Ryan Poplin e3cc7cc59c Resolving merge conflict. 2012-04-10 14:50:27 -04:00
Ryan Poplin a4634624b7 There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function. 2012-04-10 14:48:23 -04:00
Eric Banks 10e74a71eb We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior. 2012-04-10 12:30:35 -04:00
Mark DePristo b43d21056b Merged bug fix from Stable into Unstable 2012-04-10 09:42:09 -04:00
Mark DePristo 6885e2d065 UserException fixes for GATK_logs recent errors
-- SamFileReader.java:525
-- BlockCompressedInputStream:376

These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Mark DePristo 8507cd7440 Throw UserException for bad dict / chain files 2012-04-10 07:22:43 -04:00
Ryan Poplin cd9bf1bfc3 Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs. 2012-04-10 00:22:40 -04:00
Roger Zurawicki 9ece93ae9c DiagnoseTargets now outputs a VCF file
- refactored the statistics classes
 - concurrent callable statuses by sample are now available.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel 719ec9144a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-09 14:53:19 -04:00
Guillermo del Angel 550179a1f7 Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors. 2012-04-09 14:53:05 -04:00
Eric Banks f82986ee62 Adding unit tests for the very important log10sumLog10 util method. 2012-04-09 14:28:25 -04:00
Eric Banks ea4300d583 Refactoring so that Unified Argument Collection doesn't use deprecated classes. 2012-04-09 13:45:17 -04:00
Eric Banks 6ddf2170b6 More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice. 2012-04-09 11:46:16 -04:00
Mauricio Carneiro 87e6bea6c1 Adding engine capability to quantize qualities.
* Added parameter -qq to quantize qualities using a recalibration report
   * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
   * Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00
Mark DePristo c22a66870c Modified UnitTests to respect reference padding 2012-04-06 16:27:20 -04:00
Mark DePristo 45fc0ea98d Improvements to indel analysis capabilities of VariantEval
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately.  This is based on an old email from Mark Daly:

    // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
    // downstream frameshift, if we make the simplifying assumptions that 3 bp ins
    // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
    // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
    // as for deletions, and certainly would expect consistency between in/dels that
    // multiple methods find and in/dels that are unique to one method  (since deletions
    // are more common and the artifacts differ, it is probably worth looking at the totals,
    // overlaps and ratios for insertions and deletions separately in the methods
    // comparison and in this case don't even need to make the simplifying in = del functional assumption

-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
2012-04-06 16:07:46 -04:00
Mark DePristo 52ef4a3e26 Function to compute whether a VariantContext indel is part of a TandemRepeat
Returns true iff VC is an non-complex indel where every allele represents an expansion or
 contraction of a series of identical bases in the reference.

 The logic of this function is pretty simple.  Take all of the non-null alleles in VC.  For
 each insertion allele of n bases, check if that allele matches the next n reference bases.
 For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
 as it must necessarily match the first n bases.  If this test returns true for all
 alleles you are a tandem repeat, otherwise you are not.  Note that in this context n is the
 base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Mark DePristo 08fab49d30 Added function to get bases from the current base forward in the window in ReferenceContext 2012-04-06 16:07:46 -04:00
Ryan Poplin c77104b815 Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL 2012-04-06 00:22:52 -04:00
Mauricio Carneiro a19c27297f continuing the BQSR triage...
* fixed the loading of the new reduced size reports
   * reduced BQSR scala script memory to 2Gb
   * removed dcov parameter from BQSR scala script
   * fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
   * updated md5's with the proper PHRED scaled EstimatedQReported
2012-04-05 14:34:15 -04:00
Eric Banks 3561056a9c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-05 10:49:26 -04:00
Eric Banks 5c3ddec4c2 Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update. 2012-04-05 10:49:08 -04:00
Mauricio Carneiro 7c3b3650bb BQSR bug triage
* fixed bug where some keys were using the same recal datum objects
    * fixed quantization qual calculations when combining multiple reports
    * fixed rounding error with empirical quality reported when combining reports
    * fixed combine routine in the gatk reports due to the primary keys being out of order
    * added auto-recalibration option to BQSR scala script
    * reduced the size of the recalibration report by ~15%
    * updated md5's
2012-04-05 09:32:18 -04:00
Eric Banks 2c956efa53 Minor fixups to GenotypeLikelihoods 2012-04-05 09:14:37 -04:00
Mauricio Carneiro 1e65474fec Added utility to get the reference coordinate given the read coordinate 2012-04-05 09:04:20 -04:00
Guillermo del Angel 6913710e89 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 20:17:18 -04:00
Mark DePristo 76e4100d89 By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Guillermo del Angel 820216dc68 More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation 2012-04-04 16:23:10 -04:00
Ryan Poplin bfad26353a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 16:04:50 -04:00
Ryan Poplin dda2173c66 Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned. 2012-04-04 16:04:29 -04:00
Mark DePristo fcdd65a0f4 Bugfix for IndelLengthHistogram
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo 1ccea866d8 VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Eric Banks 9e32a975f8 Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore. 2012-04-04 13:47:59 -04:00
Eric Banks 337ff7887a When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals. 2012-04-04 10:57:05 -04:00
Guillermo del Angel 05d8400468 Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet) 2012-04-03 20:51:24 -04:00
Guillermo del Angel 5a10f173ea Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow) 2012-04-03 18:55:52 -04:00
Guillermo del Angel 5abb07da5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 17:00:45 -04:00
Christopher Hartl a6837d31d4 Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it. 2012-04-03 16:13:16 -04:00
Guillermo del Angel 63b1e737c6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 15:43:50 -04:00
Guillermo del Angel 9e11b4f9a7 Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced. 2012-04-03 15:43:32 -04:00
Eric Banks f9ce9962c4 Minor changes to verbose mode 2012-04-03 10:53:48 -04:00
Eric Banks f6aa95685d OutOfMemory exceptions are User Errors 2012-04-02 22:46:56 -04:00
Eric Banks 659b82e74d Old -B syntax is long gone at this point. Safe to remove the warning. 2012-04-02 22:25:16 -04:00
Eric Banks 326220c91c Removing extended event related unit tests 2012-04-02 14:40:36 -04:00
Eric Banks 99d27ddcc4 Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now. 2012-04-02 14:27:36 -04:00
Mark DePristo 6b7a00061a VariantsToTable now works with multiple input VCFs 2012-04-02 09:13:35 -04:00
Mark DePristo 4f73ea902f Final update for VE. VCFStreaming wasn't yet updated 2012-03-30 21:52:01 -04:00
Mark DePristo fbbb8509ad Final commits to VariantEval
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Mark DePristo 4b45a2c99d Final version of new VariantEval infrastructure.
*** WAY FASTER ***
 -- 3x performance for multiple sample analysis with 1000 samples
 -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
 -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2

-- Remove the TableType system, as this was way too complex.  No longer possible to embed what were effectively multiple tables in a single Evaluator.  You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis.  IndelLengthHistogram is now a @Molten data type.  GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints.  You get an error if you do.
-- Simplified entire IO system of VE.  Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo 8c0718a7c9 Fixed missing import 2012-03-30 15:31:55 -04:00
Mark DePristo 976bac0452 BaseTest now has a global variable to turn off network connection requirement 2012-03-30 15:31:55 -04:00
Mark DePristo 097ed4ecc4 Memory usage optimizations and safety improvements to StratNode and StratificationManager
-- Added memory and safety optimizations to StratNode and StratificationManager.  Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation.  The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement.  Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo b335c22f6d Fully refactored, mostly cleaned up version of VariantEval using StratificationManager 2012-03-30 15:31:55 -04:00
Mark DePristo c8086a79e3 New StratificationManager based VariantEval passes unmodified integration tests
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo d37f31e349 First version of VariantEval that runs (approximately correctly) with new StratificationManager 2012-03-30 15:31:54 -04:00
Mark DePristo 8971b54b21 Phase II of Stratification manager
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V.  All key functions are implemented.  Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo a3d896d80e Part I of creating a fast state space lookup for VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates).  This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Eric Banks 533c283783 Deprecating AlignmentContext.getExtendedEventPileup(). At this point the only walkers left with any relaiance on extended events are Guillermo's pooled code (he'll update soon) and the Pileup walker. David, I'll leave that last one for you (it should be easy). We can now officially rip the extended event code from the engine. 2012-03-30 10:37:14 -04:00
Eric Banks 6b49af253b Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there. 2012-03-30 10:33:30 -04:00
Eric Banks b467cd1dae Removing dependence on extended events for the remaining Variant Annotator modules. 2012-03-30 09:05:26 -04:00
Eric Banks b21889812d Removing some more usages of extended events. Not done yet, but almost there. 2012-03-30 01:51:37 -04:00
Eric Banks ad6ace2439 Resolving merge conflicts 2012-03-30 01:51:09 -04:00
Eric Banks 16bef191c6 UG integration tests updated. A handful of sites are lost because there are only 5 indels and one starts at the beginning of the read so it no longer passes our min threshold (now consistent with GGA), but mostly the depth changes ever so slightly once in a while between extended and normal pileups (I think the normal pileups are correct). I have looked thoroughly in IGV at ALL differences and am happy with the new results. As an aside, the AD is now calculated more accurately for indels. 2012-03-30 01:35:49 -04:00
Eric Banks f4d4969f23 Don't ever return null for the list of GL models 2012-03-30 00:22:40 -04:00
Eric Banks 44ac49aa34 Removing dependencies in the annotations on extended events. Some refactoring involved in this. 2012-03-30 00:17:02 -04:00
Mauricio Carneiro 962fc352ae unnecessary substitution. 2012-03-29 18:01:43 -04:00
Mauricio Carneiro b7c59d5d43 this was a dummy test I was using to figure out what the problem was. Deleting it. 2012-03-29 18:00:25 -04:00
Mauricio Carneiro cbd21c6339 Nasty, nasty.....
VariantEval is overly abusive of the GATKReport (lack of) spec.

   1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
   2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.

   Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.

   No wonder, the printing needed to be O(n^2).
2012-03-29 17:49:53 -04:00
Eric Banks c2e27729c7 Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly. 2012-03-29 17:08:25 -04:00
Ryan Poplin 6da9571829 resolving merge conflicts. 2012-03-29 16:16:28 -04:00
Ryan Poplin ca96544ed0 All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads. 2012-03-29 16:14:29 -04:00
Eric Banks e4469a83ee First attempt at removing all traces of extended events from UG; integration tests are expected to fail. 2012-03-29 14:59:29 -04:00
Eric Banks e61e162c81 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-29 12:33:13 -04:00
Mauricio Carneiro cf364f26a0 Fixing alignment issue with the GATKReportColumn algorithm
Numeric columns were being left-aligned when they should be right-aligned. Fixed it.
2012-03-29 12:28:49 -04:00
Mauricio Carneiro f80bd4276a fixed estimated Q reported calculation in the gatherer 2012-03-29 12:28:43 -04:00
Mauricio Carneiro 8a9fb514b6 simplifying GATKReportColumn constructor logic 2012-03-29 12:28:37 -04:00
Eric Banks e861106398 Accidentally erased important line 2012-03-29 11:08:54 -04:00
Eric Banks e4a225ed09 Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally. 2012-03-29 11:07:37 -04:00
Guillermo del Angel c9c3f6b0fc Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity 2012-03-29 11:05:42 -04:00
Ryan Poplin 9684a2efb0 HaplotypeCaller: Variants found on the same haplotype are now written out with phased genotypes. There are serious eval issues with MNPs so disabling them for now. 2012-03-29 09:41:29 -04:00
Guillermo del Angel a0843f125e Forgot to add file itself for new unit test 2012-03-28 21:08:18 -04:00
Guillermo del Angel 250adca350 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 21:01:49 -04:00
Guillermo del Angel e0ab4e4b30 Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic 2012-03-28 21:01:31 -04:00
Mauricio Carneiro 8f0e9d74ce GATKReportTable output refactor
writing out a GATKReportTable was O(n^2)!!!!!
New implementation is O(n). What a difference, when N = 2^16...
2012-03-28 17:19:12 -04:00
Guillermo del Angel 62ee31afba Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 16:00:38 -04:00
Guillermo del Angel 1eee9d512d Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal) 2012-03-28 15:41:39 -04:00
Mauricio Carneiro bb36cd4adf Quick fixes to BQSRGatherer and GATKReportTable
* when gathering, be aware that some keys will be missing from some tables.
   * when a gatktable has no elements, it should still output the header so we know it had no records
2012-03-28 09:07:54 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00