Commit Graph

34 Commits (8ed78b453f1a2c3e07a9efc703df057c0fa27c0c)

Author SHA1 Message Date
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
Mark DePristo e40d83f00e Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses.  This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length.  I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10.  This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM.  All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes.  Fixed bug.  Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit.  Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum.  This involved moving some initialize() code into the computeLikelihoods function.  That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-09 19:19:22 -05:00
David Roazen 1fb182d951 Restore Utils.appendArray()
This utility method was used by the PipelineTest class, and deleting it
was causing tests to not compile.
2013-02-01 13:51:31 -05:00
Mark DePristo 6d9816f1a5 Cleanup unused utils functions, and add unit test for one (append) 2013-02-01 13:51:31 -05:00
Mark DePristo b707331332 Encrypt GATK AWS keys using the GATK private key, and decrypt as needed as a resource when uploading to AWS logs
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text.  This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself.  Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils.  Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
2013-01-30 16:42:23 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Mauricio Carneiro 97fd5de260 Merging latest CMI updates with UNSTABLE 2012-11-27 09:08:00 -05:00
Mauricio Carneiro a3f5932501 Fixed null pointer exception in Integration Tests
When running Utils.setupWriter with NO_PG_TAG set, the writer was attempting to create a program record with the null pointer. Fixed.
2012-11-26 11:12:27 -05:00
Mark DePristo cc7680e601 NA12878 knowledge base backed by MongoDB
-- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878.  Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development
-- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK
-- Not hooked up to Tribble, but could be done reasonably easily now (future TODO)
-- Tools to import callsets, create consensus callsets, import and export reviews
-- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand)
-- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set
-- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters
-- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus.
-- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB.
-- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from.  Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker)

UnitTests
-- Can connect to a test knowledge base for development and unit testing
-- PolymorphicStatus, TruthStatus, SiteIterator
-- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc
-- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb

AssessNA12878
-- Generic tool for comparing a NA12878 callset against the knowledge base.  See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation
-- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs

Misc
-- Ability to provide Description for Simplified GATK report
2012-11-20 18:50:52 -05:00
Mauricio Carneiro a079d8d0d1 Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters 2012-11-14 10:33:22 -05:00
Mark DePristo 3663fe1555 Framework for evaluating the performance and scaling of the ExactAF models 2012-10-03 19:55:11 -07:00
Mark DePristo 17ca543937 More ExactModel cleanup
-- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array.  This is way heavy-weight compared to just making the array each time.
-- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object.  That's the place it should really live
-- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP.  Added TODOs
2012-10-03 19:55:11 -07:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Mark DePristo 669c43031a BCF2 optimizations; parallel CombineVariants
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header.  Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils.  Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary.  We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
2012-08-15 21:13:16 -04:00
Mark DePristo be3230a1fd Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Created makeCombinations utility function (very useful!).  Moved template from VariantContextTestProvider
-- UnitTests for basic functionality
2012-08-15 21:13:15 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo 84d1e8713a Infrastructure for combining VariantEvaluations
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Eric Banks 6ddf2170b6 More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice. 2012-04-09 11:46:16 -04:00
Mark DePristo 4b45a2c99d Final version of new VariantEval infrastructure.
*** WAY FASTER ***
 -- 3x performance for multiple sample analysis with 1000 samples
 -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
 -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2

-- Remove the TableType system, as this was way too complex.  No longer possible to embed what were effectively multiple tables in a single Evaluator.  You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis.  IndelLengthHistogram is now a @Molten data type.  GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints.  You get an error if you do.
-- Simplified entire IO system of VE.  Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo 679bb03014 Simple utility function for converting an Iterable<T> to Collection<T> 2012-03-27 11:54:58 -04:00
Mauricio Carneiro ec4a870a0f Added @PG tag to ReduceReads
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
2012-03-16 14:09:07 -04:00
Mauricio Carneiro e61e5c7589 Refactor of ReadClipper unit tests
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Eric Banks 35fc2e13c3 Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all. 2011-12-14 15:31:09 -05:00
Mark DePristo c2b97030a4 IntervalUtils for completely balanced locus-based scatter/gather
-- scatterLocusIntervals master utility
-- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc
-- Util function for reversing a list (List<T> -> List<T>, unlike Collections version)
-- DoC is PartitionType.INTERVAL
-- Significant unit tests on new functionality (all passing)
-- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work
2011-11-02 10:49:40 -04:00
Khalid Shakir b80d407dc7 No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path.
Other minor cleanup.
2011-10-27 14:17:07 -04:00
Mark DePristo 0b88af4af9 Counts of records failing filters are displayed sorted
-- Stops random ordering of the output, as the counts are returned sorted by string name of the class
-- Deleted now unused sh*tty assessors in Utils
2011-10-06 18:42:26 -07:00
Mark DePristo a91ac0c5db Intermediate commit of bugfixes to CombineVariants 2011-09-21 10:15:05 -04:00
Mark DePristo 7bf006278d Moved ResolveHostname to general utils as a static function 2011-08-28 12:04:16 -04:00
David Roazen d3437e62da Added a simple utility method Utils.optimumHashSize() to calculate the optimum
initial size for a Java hash table (HashMap, HashSet, etc.) given an expected
maximum number of elements. The optimum size is the smallest size that's
guaranteed not to result in any rehash / table-resize operations.

Example Usage:
Map<String, Object> hash = new HashMap<String, Object>(Utils.optimumHashSize(expectedMaxElements));

I think we're paying way too heavy a price in unnecessary rehash operations across
the GATK. If you don't specify an initial size, you get a table of size 16 that gets
completely rehashed and doubles in size every time it becomes 75% full. This means you
do at least twice as much work as you need to in order to populate your table:

(n + n/2 + n/4 + ... 16 ~= (1 + 1/2 + 1/4...) * n ~= 2 * n
2011-08-02 21:59:06 -04:00
Mark DePristo 9992c373be Optimize imports run on the whole project, public and private. I just got too tired of all of the unused imports floating around. Confirmed that the system builds after the changes. 2011-07-17 20:29:58 -04:00
David Roazen 3c9497788e Reorganized the codebase beneath top-level public and private directories,
removing the playground and oneoffprojects directories in the process. Updated
build.xml accordingly.
2011-06-28 06:55:19 -04:00