Commit Graph

1645 Commits (f9f8589692fece0185a7e8e059b75ee4672d1c8d)

Author SHA1 Message Date
Eric Banks 533c283783 Deprecating AlignmentContext.getExtendedEventPileup(). At this point the only walkers left with any relaiance on extended events are Guillermo's pooled code (he'll update soon) and the Pileup walker. David, I'll leave that last one for you (it should be easy). We can now officially rip the extended event code from the engine. 2012-03-30 10:37:14 -04:00
Eric Banks 6b49af253b Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there. 2012-03-30 10:33:30 -04:00
Eric Banks b467cd1dae Removing dependence on extended events for the remaining Variant Annotator modules. 2012-03-30 09:05:26 -04:00
Eric Banks b21889812d Removing some more usages of extended events. Not done yet, but almost there. 2012-03-30 01:51:37 -04:00
Eric Banks ad6ace2439 Resolving merge conflicts 2012-03-30 01:51:09 -04:00
Eric Banks f4d4969f23 Don't ever return null for the list of GL models 2012-03-30 00:22:40 -04:00
Eric Banks 44ac49aa34 Removing dependencies in the annotations on extended events. Some refactoring involved in this. 2012-03-30 00:17:02 -04:00
Mauricio Carneiro cbd21c6339 Nasty, nasty.....
VariantEval is overly abusive of the GATKReport (lack of) spec.

   1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
   2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.

   Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.

   No wonder, the printing needed to be O(n^2).
2012-03-29 17:49:53 -04:00
Eric Banks c2e27729c7 Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly. 2012-03-29 17:08:25 -04:00
Ryan Poplin 6da9571829 resolving merge conflicts. 2012-03-29 16:16:28 -04:00
Ryan Poplin ca96544ed0 All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads. 2012-03-29 16:14:29 -04:00
Eric Banks e4469a83ee First attempt at removing all traces of extended events from UG; integration tests are expected to fail. 2012-03-29 14:59:29 -04:00
Eric Banks e61e162c81 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-29 12:33:13 -04:00
Mauricio Carneiro cf364f26a0 Fixing alignment issue with the GATKReportColumn algorithm
Numeric columns were being left-aligned when they should be right-aligned. Fixed it.
2012-03-29 12:28:49 -04:00
Mauricio Carneiro f80bd4276a fixed estimated Q reported calculation in the gatherer 2012-03-29 12:28:43 -04:00
Mauricio Carneiro 8a9fb514b6 simplifying GATKReportColumn constructor logic 2012-03-29 12:28:37 -04:00
Eric Banks e861106398 Accidentally erased important line 2012-03-29 11:08:54 -04:00
Eric Banks e4a225ed09 Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally. 2012-03-29 11:07:37 -04:00
Guillermo del Angel c9c3f6b0fc Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity 2012-03-29 11:05:42 -04:00
Ryan Poplin 9684a2efb0 HaplotypeCaller: Variants found on the same haplotype are now written out with phased genotypes. There are serious eval issues with MNPs so disabling them for now. 2012-03-29 09:41:29 -04:00
Guillermo del Angel 250adca350 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 21:01:49 -04:00
Guillermo del Angel e0ab4e4b30 Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic 2012-03-28 21:01:31 -04:00
Mauricio Carneiro 8f0e9d74ce GATKReportTable output refactor
writing out a GATKReportTable was O(n^2)!!!!!
New implementation is O(n). What a difference, when N = 2^16...
2012-03-28 17:19:12 -04:00
Guillermo del Angel 62ee31afba Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 16:00:38 -04:00
Guillermo del Angel 1eee9d512d Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal) 2012-03-28 15:41:39 -04:00
Mauricio Carneiro bb36cd4adf Quick fixes to BQSRGatherer and GATKReportTable
* when gathering, be aware that some keys will be missing from some tables.
   * when a gatktable has no elements, it should still output the header so we know it had no records
2012-03-28 09:07:54 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel 08f7d47d7c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 07:42:09 -04:00
Mark DePristo 12aa72f200 Merged bug fix from Stable into Unstable 2012-03-27 22:43:00 -04:00
Mark DePristo 979a84a252 Bugfix for thread unsafe PL cache
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly.  My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Guillermo del Angel 8f34412fb8 First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing 2012-03-27 20:59:44 -04:00
Guillermo del Angel ed322bd73f Fix again merge issues 2012-03-27 15:03:13 -04:00
Guillermo del Angel b4a7c0d98d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-27 15:01:03 -04:00
Guillermo del Angel 343a061b1c Fix merge issues when incorporating new AF calculations changes 2012-03-27 15:00:44 -04:00
Mauricio Carneiro 1b75663178 BQSR Gatherer implementation and integration tests
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
   * optmized empirical qual calculation when merging recalibration reports
   * centralized the quality score quantization functionalities
   * unified the creating/loading of all the key manager/hash table structures.
   * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
   * added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Ryan Poplin 5dbd3625cd Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data. 2012-03-27 13:38:52 -04:00
Eric Banks c112e0824a I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there. 2012-03-27 11:09:03 -05:00
Mark DePristo a638996fe2 Cleanup of VariantEval, diatribe about performance problems with StateKey
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
2012-03-27 11:56:24 -04:00
Mark DePristo 679bb03014 Simple utility function for converting an Iterable<T> to Collection<T> 2012-03-27 11:54:58 -04:00
Mark DePristo 1f5f737c8b Optimizing the GATKReportTable.write
-- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables
2012-03-27 11:54:35 -04:00
Mark DePristo 913c8b231f Fix ErrorRatePerCycle to overload equals and hashcode
-- Fixes failing integration tests
2012-03-27 10:35:32 -04:00
Eric Banks c07a577ba3 Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously). 2012-03-27 00:27:44 -05:00
Mark DePristo 34ea443cdb Better algorithm for choosing which indel alleles are present in samples
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5.  This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc.  If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general.  We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Mark DePristo 11b6fd990a GATKReportColumn optimizations
-- Was TreeMap even though the sorting wasn't used.  Replaced with LinkedHashMap.
2012-03-26 16:28:49 -04:00
Mark DePristo 6be5e82860 VariantEval scalability optimizations
-- StateKey no longer extends TreeMap.  It's now a final immutable data structure that caches it's toString and hashcode values.  TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function.
-- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils
-- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects.  Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE
-- VariantEvaluator base class has a cached getSimpleName() function
-- VEUtils: general cleanup due to refactoring of StateKey
-- VEWalker: much better iteration of map data structures.  If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet.  This is far better than iterating over the keys and calling get() on each key.
2012-03-26 16:28:48 -04:00
Guillermo del Angel 1c424c0daf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-26 15:15:50 -04:00
Ryan Poplin 019145175b Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method 2012-03-26 11:32:44 -04:00
Ryan Poplin 1fa66f76c9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 23:04:47 -04:00
Guillermo del Angel ce617b2dfc Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code 2012-03-25 10:20:21 -04:00
Guillermo del Angel db54c2625f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 09:53:35 -04:00
Guillermo del Angel deb4586559 Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1 2012-03-24 21:49:43 -04:00
Mark DePristo b063bcd38d Removing update0 support in VariantEval
-- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations.
-- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications.
-- No need to modify the integration tests, this optimization doesn't change the result of the calculation
2012-03-23 21:02:21 -04:00
Mauricio Carneiro 0509d316d9 More information in the recalibration report
* added empirical quality counts to allow quantization during on-the-fly recalibration to any level
   * added number of observations and errors to all tables to enable plotting of all covariates
2012-03-23 16:15:19 -04:00
Mauricio Carneiro 9f74969e3a BQSR with GATKReport implementation
* restructured BQSR to report recalibrated tables.
   * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
   * linked quality score quantization to the BQSR stage, outputting a quantization histogram
   * included the arguments used in BQSR to the GATK Report
   * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities

On-the-fly recalibration with GATK Report

   * loads all tables from the GATKReport using existing infrastructure (with minor updates)
   * implemented initialiazation of the covariates using BQSR's argument list
   * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
   * applied quality quantization to the base recalibration
   * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
2012-03-23 15:42:32 -04:00
Mauricio Carneiro f421062b55 Updated read group covariate to use sample.lane instead of the id
Added Unit test.
2012-03-23 15:24:07 -04:00
Mauricio Carneiro 539da9e3e1 Fixing GATKReport exception handling when loading a report
* allowing tables with no description to go through
   * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)
2012-03-23 15:23:13 -04:00
Eric Banks 2511839068 Merged bug fix from Stable into Unstable 2012-03-23 13:51:33 -04:00
Eric Banks d3f2bc4361 Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles. 2012-03-23 13:51:00 -04:00
Mark DePristo e4ec90cfce Merged bug fix from Stable into Unstable 2012-03-23 11:27:34 -04:00
Mark DePristo ff26f2bf68 HierarchicalMicroScheduler no longer attempts to wrap exceptions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors.  Now HMS simply checks whether the throwable it received on error was a RuntimeException.  If so, it is stored and rethrow without wrapping later.  If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
2012-03-23 11:27:21 -04:00
Ryan Poplin 9d22471b79 Merged bug fix from Stable into Unstable 2012-03-23 10:48:34 -04:00
Ryan Poplin ab288354e9 Better error message for malformed input recal file. 2012-03-23 10:47:01 -04:00
Mark DePristo fee8d86f63 VariantEval optimization
-- Use a LinkedHashMap not a TreeMap so iteration is faster.
-- Note that with a lot of stratifications the update0 is taking up a lot of time.  For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call
2012-03-22 22:13:24 -04:00
Mark DePristo 6df96644d9 Unified, standard IndelSummary metrics for VariantEval
-- Now you always get SNP and indel metrics with VariantEval!
--   Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
2012-03-22 21:24:37 -04:00
Mark DePristo bcf80cc7b3 Cleanup in VariantEval. Example of molten VariantEval output
-- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share.  Code updated to use these routines where appropriate
-- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF
-- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication.
-- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden.
-- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets.  See IndelHistogram class (in later commit) for example of molten VE output
2012-03-22 21:24:37 -04:00
Mark DePristo 9ddd5aec93 More eval modules being removed from VariantEval
-- IndelStatistics is superceded by IndelStatistics
2012-03-22 21:24:36 -04:00
Mark DePristo bd5b6d1aba Remove no longer in use Eval modules from VariantEval
-- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit)
-- No more SamplePreviousGenotypes or PhaseStats
-- No more MultiallelicAFs
2012-03-22 21:24:36 -04:00
Menachem Fromer 7faa9938b1 Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 17:43:44 -04:00
Menachem Fromer b9b9219ac7 Added respectPhaseInInput flag to RBP and integration tests 2012-03-22 17:40:21 -04:00
Guillermo del Angel f198cec5e2 Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated 2012-03-22 15:46:39 -04:00
Menachem Fromer 1dfaacfeb5 Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing 2012-03-22 12:40:15 -04:00
Guillermo del Angel b02ef95bcf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 12:14:12 -04:00
Guillermo del Angel 92676c63ca Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests 2012-03-22 12:13:59 -04:00
Guillermo del Angel 58965d6a6e Merged bug fix from Stable into Unstable 2012-03-22 11:04:11 -04:00
Guillermo del Angel b8cd959461 Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list 2012-03-22 10:06:22 -04:00
Ryan Poplin a29fc6311a New debug option to output the assembly graph in dot format. Merge nodes in assembly graph when possible. 2012-03-21 15:48:55 -04:00
Eric Banks 8c09ff9459 Merged bug fix from Stable into Unstable 2012-03-21 12:44:43 -04:00
Eric Banks 58245bfa2f Bug fix: check to see whether there's a BasePileup before asking for one. 2012-03-21 12:44:09 -04:00
Eric Banks 07c3bd32b3 Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention... 2012-03-21 12:42:13 -04:00
Eric Banks dcf2fa361d Minor cleanup 2012-03-21 12:14:31 -04:00
Eric Banks ab1c48745b Need to catch RuntimeExceptions coming out of Picard too so that they show up as UserErrors (some BAM errors are thrown as REs). 2012-03-21 12:13:52 -04:00
Ryan Poplin 9e10779fa7 Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests. 2012-03-21 08:45:42 -04:00
Mauricio Carneiro 0e93cf5297 Taking care of bad cigars in the GATK
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
   * added Unit tests for BadCigarFilter
   * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
   * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
2012-03-20 14:32:57 -04:00
Eric Banks 5e79046c98 Minor change but I realized from Mark's commit that the code I stole it from was flawed 2012-03-20 08:55:56 -04:00
Eric Banks ade1971581 Since we allow any generic header types, there's no longer any reason to check for supported types 2012-03-20 00:12:17 -04:00
Eric Banks 2324c5a74f Simplified the interface for simple VCF header lines by making the VCFSimpleHeaderLine not abstract anymore - now any arbitrary header line with an ID (e.g. the contig and ALT lines) can be part of this class without having to define new classes. Also, renamed the 'named' header line to 'id' since that's more accurate. 2012-03-19 21:29:24 -04:00
Roger Zurawicki 7afb333811 GATK Report code cleanup
- Updated the documentation on the code
 - Made the table.write() method private and updated necessary files.
 - Added a constructor to GATKReport that takes GATKReportTables
 - Optimized my code

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-19 11:53:57 -04:00
Mauricio Carneiro 0d4ea30d6d Updating the BQSR Gatherer to the new file format
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
2012-03-19 09:02:27 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 344a938a70 When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array. 2012-03-18 00:36:30 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Mauricio Carneiro ec4a870a0f Added @PG tag to ReduceReads
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
2012-03-16 14:09:07 -04:00
Mauricio Carneiro 3bfca0ccfd BitSet implementation of the on-the-fly recalibration using the CSV format file.
Infrastructure:
   * Added static interface to all different clipping algorithms of low quality tail clipping
   * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
   * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
   * EventType is now an independent enum with added capabilities. All functionality is now centralized.

 BQSR and RecalibrateBases:
   * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
   * Refactored the object creation to take advantage of the compact key structure
   * Replaced nested hash maps with single hash maps indexed by bitsets
   * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
   * Excluded contexts with N's from the output file.
   * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
   * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
   * Added the covariate ID (for optional covariates) to the output for disambiguation purposes
   * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
   * Reduced memory usage of the BQSR script to 4

 Tests:
   * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
   * Added tests for keys without optional covariates
   * Added tests for on-the-fly recalibration (but more tests are necessary)
2012-03-16 13:02:15 -04:00
Mauricio Carneiro ca11ab39e7 BitSets keys to lower BQSR's memory footprint
Infrastructure:
	* Generic BitSet implementation with any precision (up to long)
	* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
	* Memoized implementation of the BitSet utils for better performance.
	* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
	* Replace log/sqrt with bitwise logic to get rid of numerical issues

 BQSR:
	* All covariates output BitSets and have the functionality to decode them back into Object values.
	* Covariates are responsible for determining the size of the key they will use (number of bits).
	* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
	* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead

 Tests:
	* Unit tests added to every method of BitSetUtils
	* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
	* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
2012-03-16 13:01:48 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 1429ddcf55 Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine 2012-03-14 21:25:43 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00