gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Ryan Poplin	c025e84c8b	Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping. -- Updated a few integration tests for HC, UG, and UG general ploidy	2013-02-19 14:09:24 -05:00
Mark DePristo	be45edeff2	ActivityProfile and ActiveRegions respects engine interval boundaries -- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present. -- UnitTests for ActiveRegion.splitAndTrimToIntervals -- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals -- UnitTesting overlap function in GenomeLocSortedSet -- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775 -- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals -- Added docs for the constructors in GLSS -- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s -- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser	2013-02-18 10:40:25 -05:00
Ryan Poplin	b7e9c342c7	Reducing the size of the reference padding in the HaplotypeCaller.	2013-02-17 11:09:00 -05:00
Mark DePristo	73a363b166	Update MD5s due to new QualityUtils calculations -- Increase the allowed runtime of one UG integration test -- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before. Some updates can push this right over the edge. Increased limit -- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster -- Updating MD5s due to more correct quality utils. DuplicatesWalkers quality estimates have changed. One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different	2013-02-16 07:31:38 -08:00
Mark DePristo	3231031c1a	Bugfix for FisherStrand -- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score	2013-02-16 07:31:38 -08:00
Mark DePristo	9a29d6d4be	Fix an catastrophic bug (WoW!) in the reference calculation of the UG -- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test. -- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.	2013-02-16 07:31:38 -08:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
MauricioCarneiro	d80b99143f	Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771	2013-02-15 08:32:45 -08:00
MauricioCarneiro	1dd284a5bb	Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720 PrintReads writes a header when used with -BQSR	2013-02-15 07:18:28 -08:00
MauricioCarneiro	b58a0eca6b	Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62 Refactored GATKDocs categories some more ( GSATDG-62 )	2013-02-14 22:35:07 -08:00
Tad Jordan	6cb80591e3	PrintReads writes a header when used with -BQSR	2013-02-14 22:19:14 -05:00
Guillermo del Angel	b18f216033	Updated md5's from BiasedDownsamplerIntegrationTest that changed due to changes in HaplotypeCaller - changing HashMaps to LinkedHashMaps changed ordering of reads presented to BiasedDownSampler which changed reads chosen, thereby marginally changing PL's and some site info.	2013-02-14 20:18:49 -05:00
Ryan Poplin	871c8b3866	No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference.	2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
depristo	357d196dad	Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765 Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.	2013-02-13 10:08:11 -08:00
Yossi Farjoun	6d12e5a54f	Fixed md5s for the per-sample downsampling IntegrationTests that were disabled. - got md5s from a interim version that does not have the per-sample downsampling hookedup - added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file	2013-02-13 12:49:39 -05:00
Guillermo del Angel	4308b27f8c	Fixed non-determinism in HaplotypeCaller and some UG calls - -- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads. -- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast) -- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling). -- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test -- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).	2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera	dff5ef562b	Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details) -- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools -- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities -- Reworded some category names and descriptions to be more explicit and user-friendly	2013-02-12 13:36:15 -05:00
Ryan Poplin	3f2f837b6a	Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads.	2013-02-11 11:36:09 -05:00
Mark DePristo	b4417dff5b	Updating MD5s due to changes in HMM -- New HMM has two impacts on MD5s. First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed. This is for the good, especially given the computational cost of this annotationa and unclear value for HC. Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods. -- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix -- Disabled the non-deterministic GGA test. Assigned JIRA to Guillermo -- With this push I expect all integration tests to pass	2013-02-09 19:19:28 -05:00
Mark DePristo	35139cf990	HaplotypeScore only annotates SNPs -- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer. This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller. So I'm simply not going to emit this annotation value any longer for indels and for the HC	2013-02-09 19:19:28 -05:00
Mark DePristo	e40d83f00e	Final version of PairHMMs with correct edge conditions -- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites -- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done -- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow -- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact -- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC -- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now -- Added locs of docs -- Generalize unit tests with multiple equivalent matches of read to haplotype -- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10 -- Functions to dumpMatrices for debugging -- Fix nasty bug (without original unit tests) in LoglessPairHMM -- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again. -- Added dupString(string, n) method to Utils -- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes -- Unit tests for the hapStartIndex functionality of PairHMM -- Moved computeFirstDifferingPosition to PairHMM, and added unit tests -- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10 -- Still TODOs left in the code that I'll fix up -- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so -- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal -- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors	2013-02-09 19:19:22 -05:00
Mark DePristo	09595cdeb9	Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments	2013-02-09 13:06:54 -05:00
Mark DePristo	2d802e17a4	Delete the CachingPairHMM	2013-02-09 13:06:54 -05:00
Mark DePristo	7dcafe8b81	Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods -- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs -- Actual working unit tests for PairHMMUnitTest -- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results -- PairHMM has protected variables used throughout the subclasses	2013-02-09 13:06:54 -05:00
Mark DePristo	7fb620dce7	Generalize and fixup ContigComparator -- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it) -- Added unit tests	2013-02-09 09:52:13 -05:00
Mauricio Carneiro	d004bfbe6f	walker to calculate per base coverage distribution -- Base distribution optionally includes deletions -- Implemented an optional filtered coverage distribution option -- Integration tests added for every feature of the traversal This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage. GSATDG-45 #resolve	2013-02-07 16:33:05 -05:00
Mauricio Carneiro	5f49c95cc1	Added distance across contigs calculation to GenomeLocs -- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader) -- unit test added GSATDG-45	2013-02-07 16:31:41 -05:00
depristo	cd4aec177a	Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761 Aw reduceread perf 1 gsa 761	2013-02-07 12:11:05 -08:00
Eric Banks	9826192854	Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class! * AlignmentUtils.getMismatchCount() * AlignmentUtils.calcAlignmentByteArrayOffset() * AlignmentUtils.readToAlignmentByteArray(). * AlignmentUtils.leftAlignIndel()	2013-02-07 13:04:24 -05:00
Alec Wysoker	e88bc753aa	Replace with map.containsKey followed by map.get with map.get followed by null check.	2013-02-07 11:58:41 -05:00
Alec Wysoker	72e496d6f3	Eliminate unnecessary zeroing out of primitive arrays immediately after new.	2013-02-07 11:57:43 -05:00
Eric Banks	481982202d	Fixing the failing RR integration tests. * After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk * However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s * Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now * Fixed the bam	2013-02-06 12:40:56 -05:00
Mark DePristo	59df329776	Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc -- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together. This saves us a lot of work on average -- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception	2013-02-06 10:34:09 -05:00
eitanbanks	584899329c	Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692 Replace org.broadinstitute.variant with jar built from the Picard repo	2013-02-06 07:22:30 -08:00
Eric Banks	562f2406d7	Added check that BaseRecalibrator is not being run on a reduced bam. - Throws user exception if it is. - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument. - Added test to check this is working. - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end. - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process. - Added comment about not changing the program record name, as per reviewer comments. - Removed unused variable.	2013-02-06 10:14:27 -05:00
Eric Banks	4e5ff3d6f1	Bug fix for NPE in HC with --dbsnp argument. - I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null. - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.	2013-02-05 21:59:19 -05:00
Eric Banks	e7c35a907f	Fixes to BQSR for the --maximum_cycle_value argument. - It's now written into the recal report so that it can be used in the PrintReads step. - Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before! - Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR. (This is my last non-branch commit; all future pushes will follow new GATK practices)	2013-02-05 17:38:03 -05:00
David Roazen	e7e76ed76e	Replace org.broadinstitute.variant with jar built from the Picard repo The migration of org.broadinstitute.variant into the Picard repo is complete. This commit deletes the org.broadinstitute.variant sources from our repo and replaces it with a jar built from a checkout of the latest Picard-public svn revision.	2013-02-05 17:24:25 -05:00
Ryan Poplin	cb2dd470b6	Moving the random number generator over to using GenomeAnalysisEngine.getRandomGenerator in the logless versus exact pair hmm unit test. We don't believe this will fix the problem with the non-deterministic test failures but it will give us more information the next time it fails.	2013-02-05 12:56:20 -05:00
MauricioCarneiro	050c4794a5	Merge pull request #11 from yfarjoun/per_sample2 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...	2013-02-05 08:04:29 -08:00
Eric Banks	23c6aee236	Added in some basic unit tests for polyploid consensus creation in RR. - Uncovered small bug in the fix that I added yesterday, which is now fixed properly. - Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]). - Need to consult Mauricio on what to do here (are we supporting het compression for deletions? (Insertions are definitely not supported)	2013-02-05 10:35:45 -05:00
Yossi Farjoun	de03f17be4	-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling. -Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added -includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test. -(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper. -New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map). -Unit tests to the new class and function are provided. -Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly. -Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.	2013-02-04 18:24:36 -05:00
Eric Banks	70f3997a38	More RR tests and fixes. * Fixed implementation of polyploid (het) compression in RR. * The test for a usable site was all wrong. Worked out details with Mauricio to get it right. * Added comprehensive unit tests in HeaderElement class to make sure this is done right. * Still need to add tests for the actual polyploid compression. * No longer allow non-diploid het compression; I don't want to test/handle it, do you? * Added nearly full coverage of tests for the BaseCounts class.	2013-02-04 15:55:15 -05:00
Ryan Poplin	79ef41e7b1	Added some docs, unit test, and contracts to SimpleDeBruijnAssembler. -- Testing that cycles in the reference graph fail graph construction appropriately. -- Minor bug fix in assembly with reduced reads. Added some docs and contracts to SimpleDeBruijnAssembler Added a unit test to SimpleDeBruijnAssembler	2013-02-04 15:17:22 -05:00
Geraldine Van der Auwera	43e3a040b6	Updated UnifiedGenotyper GATKDoc (note on ploidy model)	2013-02-04 14:18:56 -05:00
Chris Hartl	41a030f4b7	Apparently I'm a failure at rebasing...there should have been only one commit message to write. But whatever, here it is again: Part 1 of Variant Annotator Unit tests: PerReadAlleleLikelihoodMap - Added contract enforcement for public methods - Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method - added method documentation for non getters/setters - finals, finals everywhere - Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.	2013-02-04 14:16:28 -05:00
Ryan Poplin	d9fd89ecaa	Somehow these md5 updates got lost in my previous git rebase disaster. Sorry for the trouble.	2013-02-04 13:26:18 -05:00
Eric Banks	2d518f3063	More RR-related updates and tests. - ReduceReads by default now sets up-front ReadWalker downsampling to 40x per start position. - This is the value I used in my tests with Picard to show that memory issues pretty much disappeared. - This should hopefully take care of the memory issues being reported on the forum. - Added javadocs to SlidingWindow (the main RR class) to follow GATK conventions. - Added more unit tests to increase coverage of BaseCounts class. - Added more unit tests to test I/D operators in the SlidingWindow class.	2013-02-04 12:57:43 -05:00
Guillermo del Angel	971ded341b	Swap java Random generator for GATK one to ensure test determinism	2013-02-04 10:57:34 -05:00
Guillermo del Angel	f31bf37a6f	First step in better BQSR unit tests for covariates (not done yet): more test coverage in basic covariates, test logging several read groups/read lengths and more combinations simultaneously. Add basic Javadocs headers for PerReadAlleleLikehoodMap.	2013-02-03 15:31:30 -05:00
Eric Banks	03df5e6ee6	- Added more comprehensive tests for consensus creation to RR. Still need to add tests for I/D ops. - Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure). - Also added minor cleanup of BaseUtils	2013-02-01 15:37:19 -05:00
Ryan Poplin	2fee000dba	Adding unit tests for KBestPaths class and fixing edge case bugs.	2013-02-01 13:51:31 -05:00
David Roazen	c6581e4953	Update MD5s to reflect version number change in the BAM header I've confirmed via a script that all of these differences only involve the version number bump in the BAM headers and nothing else: < @HD VN:1.0 GO:none SO:coordinate --- > @HD VN:1.4 GO:none SO:coordinate	2013-02-01 13:51:31 -05:00
Guillermo del Angel	a520058ef6	Add option to specify maximum STR length to RepeatCovariates from command line to ease testing	2013-02-01 13:51:31 -05:00
Mark DePristo	22f7fe0d52	Expanded unit tests for AlignmentUtils -- Added JIRA entries for the remaining capabilities to be fixed up and unit tested	2013-02-01 13:51:31 -05:00
Ryan Poplin	ac033ce41a	Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference.	2013-01-31 11:32:19 -05:00
Ryan Poplin	495bca3d1a	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-31 10:12:26 -05:00
Ryan Poplin	ca6968d038	Use base List and Map types in the GenotypingEngineUnitTest.	2013-01-31 10:12:18 -05:00
Eric Banks	75ceddf9e5	Adding new unit tests for RR. These tests took a frustratingly long time to get to pass, but now we have a framework for testing the adding of reads into the SlidingWindow plus consensus creation. Will flesh these out more after I take care of some other items on my plate.	2013-01-31 09:46:38 -05:00
Ryan Poplin	bb29bd7df7	Use base List and Map types in the HaplotypeCaller when possible.	2013-01-30 17:09:27 -05:00
Ryan Poplin	5f4a063def	Breaking up my massive commits into smaller pieces that I can successfully merge and digest. This one enables downsampling in the HaplotypeCaller (by lowering the default dcov to 20) and removes my long-standing, temporary region-based downsampling.	2013-01-30 16:14:07 -05:00
David Roazen	591df2be44	Move additional VariantContext utility methods back to the GATK Thanks to Eric for his feedback	2013-01-30 13:58:17 -05:00
Ryan Poplin	ff8ba03249	Updating BQSR integration test md5s to reflect the updates to the hierarchicalBayesianQualityEstimate function	2013-01-30 13:30:18 -05:00
Ryan Poplin	85dabd321f	Adding unit tests for hierarchicalBayesianQualityEstimate function	2013-01-30 13:26:07 -05:00
Ryan Poplin	07fe3dd1ef	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-30 13:19:24 -05:00
David Roazen	9985f82a7a	Move BaseUtils back to the GATK by request, along with associated utility methods	2013-01-30 13:09:44 -05:00
Ryan Poplin	2967776458	The Empirical quality column in the recalibration report can't be compared in the BQSRGatherer because the value is calculated using the Bayesian estimate with different priors. This value should never be used from a recalibration report anyway except during plotting.	2013-01-30 12:28:14 -05:00
Eric Banks	d067c7f136	Resolving merge conflicts	2013-01-30 10:47:59 -05:00
Eric Banks	9025567cb8	Refactoring the SimpleGenomeLoc into the now public utility UnvalidatingGenomeLoc and the RR-specific FinishedGenomeLoc. Moved the merging utility methods into GenomeLoc and moved the unit tests around accordingly.	2013-01-30 10:45:29 -05:00
Mark DePristo	4852c7404e	GenomeLocs are already comparable, so I'm removing the less complete GenomeLocComparator class and updating ReduceReads and CompressionStash to use built-in comparator	2013-01-30 10:12:27 -05:00
Ryan Poplin	59311aeea2	Getting back null values from the tables is perfectly reasonable if those covariates don't appear in your table. Need to handle them gracefully.	2013-01-30 10:06:14 -05:00
Ryan Poplin	e7d7d70247	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-30 10:01:06 -05:00
Mark DePristo	92c5635e19	Cleanup, document, and unit test ActiveRegion -- All functions tested. In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads. New version should be correct -- Enforce correct ordering of supporting states in constructor -- Enforce read ordering when adding reads to an active region in add -- Fix bug in HaplotypeCaller map with new updating read spans. Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence -- Encapsulate isActive field with an accessor function -- Make sure that all state lists are unmodifiable, and that the docs are clear about this -- ActiveRegion equalsExceptReads is for testing only, so make it package protected -- ActiveRegion.hardClipToRegion must resort reads as they can become out of order -- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region. The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed. Has a minor impact on PLs and RankSumTest values -- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy	2013-01-30 09:47:12 -05:00
Mauricio Carneiro	3d9a83c759	BaseCoverageDistributions should be 'by reference' otherwise we miss all the 0 coverage spots.	2013-01-29 22:37:44 -05:00
Mauricio Carneiro	29fd536c28	Updating licenses manually Please check that your commit hook is properly pointing at ../../private/shell/pre-commit Conflicts: public/java/test/org/broadinstitute/variant/VariantBaseTest.java	2013-01-29 17:27:53 -05:00
David Roazen	a536e1da84	Move some VCF/VariantContext methods back to the GATK based on feedback -Moved some of the more specialized / complex VariantContext and VCF utility methods back to the GATK. -Due to this re-shuffling, was able to return things like the Pair class back to the GATK as well.	2013-01-29 16:56:55 -05:00
Eric Banks	e4ec899a87	First pass at adding unit tests for the RR framework: I have added 3 tests and all 3 uncovered RR bugs! One of the fixes was critical: SlidingWindow was not converting between global and relative positions correctly. Besides not being correct, it was resulting in a massive slow down of the RR traversal. That fix definitely breaks at least one of the integration tests, but it's not worth changing md5s now because I'll be changing things all over RR for the next few days, so I am going to let that test fail indefinitely until I can confirm general correctness of the tool.	2013-01-29 15:51:07 -05:00
Ryan Poplin	cba89e98ad	Refactoring the Bayesian empirical quality estimates to be in a single unit-testable function.	2013-01-29 15:50:46 -05:00
Guillermo del Angel	1d5b29e764	Unit tests for repeat covariates: generate 100 random reads consisting of tandem repeat units of random content and size, and check that covariates match expected values at all positions in reads. Fixed corner case where value of covariate at border between 2 tandem repeats of different length/content wasn't consistent	2013-01-29 15:23:02 -05:00
Guillermo del Angel	c11197e361	Refactored repeat covariates to eliminate duplicated code - now all inherit from basic RepeatCovariate abstract class. Comprehensive unit tests coming...	2013-01-29 10:10:24 -05:00
Ryan Poplin	35543b9cba	updating BQSR integration test values for the PR half of BQSR.	2013-01-29 09:47:57 -05:00
Ryan Poplin	bf25196a0b	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-28 22:33:13 -05:00
Ryan Poplin	1f254d29df	Don't set the empirical quality when reading in the recal table because then we won't be using the new quality estimates for the prior since the value is cached.	2013-01-28 22:16:43 -05:00
Guillermo del Angel	ff799cc79a	Fixed bad merge	2013-01-28 20:04:25 -05:00
Guillermo del Angel	5995f01a01	Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet: a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push. b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value. c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress. d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing	2013-01-28 19:55:46 -05:00
Ryan Poplin	d665a8ba0c	The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior.	2013-01-28 15:56:33 -05:00
Ryan Poplin	aab160372a	No need to sort the BQSR tables by default.	2013-01-28 11:26:01 -05:00
David Roazen	f63f27aa13	org.broadinstitute.variant refactor, part 2 -removed sting dependencies from test classes -removed org.apache.log4j dependency -misc cleanup	2013-01-28 09:03:46 -05:00
Mauricio Carneiro	1aee8f205e	Tool to calculate per base coverage distribution GSATDG-29 #resolve	2013-01-27 23:38:46 -05:00
Mark DePristo	804caf7a45	HaplotypeCaller Optimization: return a inactive (p = 0.0) activity if the context has no bases in the pileup -- Allows us to avoid doing a lot of misc. work to set up the genotype when we don't have any data to genotype. Valuable in the case where we are passing through large regions without any data	2013-01-27 14:10:06 -05:00
Ami Levy-Moonshine	b4447cdca2	In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample). Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.	2013-01-25 15:49:51 -05:00
Mark DePristo	3f95f39be3	Updating HC md5s for new cutting algorithm and default band pass filter parameters	2013-01-25 11:07:29 -05:00
Eric Banks	f7b80116d6	Don't let users play with the different exact model implementations.	2013-01-25 10:52:02 -05:00
Eric Banks	6dd0e1ddd6	Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants. This allows us to move SelectVariants into the public suite of tools now.	2013-01-25 09:42:04 -05:00
Mark DePristo	592f90aaef	ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region -- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878 -- Misc. commenting of the code -- Updated ActiveRegionExtension to include a min active region size -- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now	2013-01-24 13:48:00 -05:00
Eric Banks	6790e103e0	Moving lots of walkers back from protected to public (along with several of the VA annotations). Let's see whether Mauricio's automatic git hook really works!	2013-01-24 11:42:49 -05:00
Chris Hartl	a3b98daf1a	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-23 14:49:34 -05:00
Chris Hartl	7fcfa4668c	Since GenotypeConcordance is now a standalone walker, remove the old GenotypeConcordance evaluation module and the associated integration tests.	2013-01-23 14:47:23 -05:00
Mark DePristo	8026199e4c	Updating md5s for CountReadsInActiveRegions and HaplotypeCaller to reflect new activity profile mechanics -- In this process I discovered a few missed sites in the old code. The new approach actually produces better HC results than the previous version.	2013-01-23 13:46:01 -05:00
Mark DePristo	8d9b0f1bd5	Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile -- Required before I jump in an redo the entire activity profile so it's can be run imcrementally -- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class. -- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality. Almost all of the misc. walker changes are due to this name update -- Code cleanup and docs for TraverseActiveRegions -- Expanded unit tests for ActivityProfile and ActivityProfileState	2013-01-23 13:45:21 -05:00
Chris Hartl	c500e1d8ac	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-22 15:31:30 -05:00
Chris Hartl	d33c755aea	Adding docs.	2013-01-22 15:29:33 -05:00
Chris Hartl	7060e01a8e	Fix for broken unit test plus some minor changes to comments. Unit tests were broken by my pulling the site status utility function into the enum. Thankfully the unit tests caught my silly duplication of a line.	2013-01-22 15:14:41 -05:00
Mauricio Carneiro	7b8b064165	Last manual license update (hopefully) if everyone updates their git hook accordingly, this will be the last time I have to manually run the script. GSATDG-5	2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine	0fb7b73107	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine	826c29827b	change the default VCFs gatherer of the GATK (not just the UG)	2013-01-18 15:03:12 -05:00
Eric Banks	cac439bc5e	Optimized the Allele Biased Downsampling: now it doesn't re-sort the pileup but just removes reads from the original one. Added a small fix that slightly changed md5s.	2013-01-18 11:17:31 -05:00
Chris Hartl	08d2da9057	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-18 10:28:45 -05:00
Chris Hartl	bf5748a538	Forgot to actually put in the md5. Also with the new change to record pairing and filtering, the multiple-records integration test changed: the indel records (T/TG \| T/TGACA) are matched up (rather than left separate) resulting in properly identifying mismatching alleles, rather than HET-UNAVAILABLE and UNAVAILABLE-HET. Very nice.	2013-01-18 10:25:36 -05:00
Chris Hartl	91030e9afa	Bugfix: records that get paired up during the resolution of multiple-records-per-site were not going into genotype-level filtering. Caught via testing. Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items: 1) Docs 2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp) 3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)	2013-01-18 09:49:48 -05:00
Eric Banks	39c73a6cf5	1. Ryan and I noticed that the FisherStrand annotation was completely busted for indels with reduced reads; fixed. 2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed. 3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.	2013-01-18 03:35:48 -05:00
Eric Banks	6a903f2c23	I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller. I've resigned myself instead to create a mapping from Allele to Haplotype. It's cheap so not a big deal, but really shouldn't be necessary. Ryan and I are talking about refactoring for GATK2.5.	2013-01-18 01:21:08 -05:00
Eric Banks	6db3e473af	Better error message for bad qual	2013-01-17 10:30:04 -05:00
Eric Banks	953592421b	I think we got out of sync with the HC tests as we were clobbering each other's changes. Only differences here are to some RankSumTest values.	2013-01-17 09:19:21 -05:00
Eric Banks	ded659232b	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 22:49:56 -05:00
Eric Banks	a623cca89a	Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call, we were incorrectly trying to find the reference position of a base on the read that didn't exist. Added integration test to cover this case.	2013-01-16 22:47:58 -05:00
Eric Banks	dbb69a1e10	Need to use ints for quals in HaplotypeScore instead of bytes because of overflow (they are summed when haplotypes are combined)	2013-01-16 22:33:16 -05:00
Chris Hartl	e15d4ad278	Addition of moltenize argument for moltenized tabular output. NRD/NRS not moltenized because there are only two columns.	2013-01-16 18:00:23 -05:00
Mark DePristo	3c476a92a2	Add dummy functionality (currently throws an error) to allow HC to include unmapped reads during assembly and calling	2013-01-16 16:25:36 -05:00
Eric Banks	4cf34ee9da	Bug fix to FisherStrand: do not let it output INFINITY. This all needs to be unit tested, but that's coming on the horizon.	2013-01-16 15:35:04 -05:00
Mark DePristo	2a42b47e4a	Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART -- UnitTests now include combinational tiling of reads within and spanning shard boundaries -- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads -- Updating HC and CountReadsInActiveRegions integration tests	2013-01-16 15:30:00 -05:00
Eric Banks	e47a389b26	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 14:59:11 -05:00
Eric Banks	d18dbcbac1	Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base. Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...	2013-01-16 14:55:33 -05:00
Khalid Shakir	4ffb43079f	Re-committing the following changes from Dec 18: Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms. Updated SelectHeaders to print out full interval arguments. Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.	2013-01-16 12:43:15 -05:00
Eric Banks	445735a4a5	There was no reason to be sharing the Haplotype infrastructure between the HaplotypeCaller and the HaplotypeScore annotation since they were really looking for different things. Separated them out, adding efficiencies for the HaplotypeScore version.	2013-01-16 11:10:13 -05:00
Eric Banks	392b5cbcdf	The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases. This way walkers won't see anything except the standard bases plus Ns in the reference. Added option to turn off this feature (to maintain backwards compatibility). As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...	2013-01-16 10:22:43 -05:00
Eric Banks	4fb3e48099	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 00:13:38 -05:00
Eric Banks	0d282a7750	Bam writing from HaplotypeCaller seems to be working on all my test cases. Note that it's a hidden debugging option for now. Please let me know if you notice any bad behavior with it.	2013-01-16 00:12:02 -05:00
Chris Hartl	327169b283	Refactor the method that identifies the site overlap type into the type enum class (so it can be used elsewhere potentially). Completed todo item: for sites like (eval) 20 12345 A C 20 12345 A AC (comp) 20 12345 A C 20 12345 A ACCC the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally (eval) 20 12345 A AC (comp) 20 12345 A ACCC would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively. + integration test	2013-01-15 12:13:45 -05:00
Eric Banks	d3baa4b8ca	Have Haplotype extend the Allele class. This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient). Also, now we can easily get the Haplotype associated with the best allele for a given read.	2013-01-15 11:36:20 -05:00
Mark DePristo	3c37ea014b	Retire original TraverseActiveRegion, leaving only the new optimized version -- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests	2013-01-15 10:24:45 -05:00
Eric Banks	94800771e3	1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted. 2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs. Until the HC uses meta data though, this won't work.	2013-01-15 10:19:18 -05:00
Chris Hartl	682c59ff04	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-14 13:27:34 -05:00
Chris Hartl	61bc334df1	Ensure output table formatting does not contain NaNs. For (0 eval ref calls)/(0 comp ref calls), set the proportion to 0.00. Added integration tests (checked against manual tabulation)	2013-01-14 09:21:30 -05:00
Ryan Poplin	a7fe334a3f	calculating the md5s for the new tests.	2013-01-11 15:43:52 -05:00
Ryan Poplin	65afec2a53	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-11 15:22:52 -05:00
Mark DePristo	85b529cced	Updating MD5s in HC and UG that changed due to new LIBS -- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list. Not ideal) -- Trivial read count differences in HC	2013-01-11 15:17:19 -05:00
Mark DePristo	8b83f4d6c7	Near final cleanup of PileupElement -- All functions documented and unit tested -- New constructor interface -- Cleanup some uses of old / removed functionality	2013-01-11 15:17:17 -05:00
Mark DePristo	fb9eb3d4ee	PileupElement and LIBS cleanup -- function to create pileup elements in AlignmentStateMachine and LIBS -- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing	2013-01-11 15:17:17 -05:00
Mark DePristo	cc1d259cac	Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement -- Added unit tests for this behavior. Updated users of this code	2013-01-11 15:17:17 -05:00
Mark DePristo	2c38310868	Create LIBS using new AlignmentStateMachine infrastructure -- Optimizations to AlignmentStateMachine -- Properly count deletions. Added unit test for counting routines -- AlignmentStateMachine.java is no longer recursive -- Traversals now use new LIBS, not the old one	2013-01-11 15:17:17 -05:00
Mark DePristo	b53286cc3c	HaplotypeCaller mode to skip assembly and genotyping for performance testing -- Added HCPerformance evaluation Qscript -- Added some docs about one of the HC integration tests -- HaplotypeCaller / ART performance evaluation script	2013-01-11 15:17:16 -05:00
Ryan Poplin	e952296c10	Adding HC GGA integration test to cover duplicated input alleles.	2013-01-11 15:01:27 -05:00
Ryan Poplin	7f7f40f851	Adding additional HC GGA integration tests to cover more complicated input alleles.	2013-01-11 14:36:21 -05:00
Eric Banks	85baf71b39	Merged bug fix from Stable into Unstable	2013-01-11 11:05:27 -05:00
Eric Banks	d78539774f	Another RR bug: off by one error led to ArrayIndexOutOfBoundsException when working with multiple samples and the variant region ended 1 base after the end of the last read for a given sample.	2013-01-11 11:05:09 -05:00
Eric Banks	79b93f659c	Merged bug fix from Stable into Unstable	2013-01-11 09:20:13 -05:00
Eric Banks	67fafbb625	Forgot an include	2013-01-11 09:19:46 -05:00
Eric Banks	6bf0cc32f9	When reducing multiple samples it is possible to try to close a region that for a given sample has no reads. Currently we'd NPE. Fixed.	2013-01-11 09:16:19 -05:00
Eric Banks	e7906713d9	Moving some random walkers back to public as requested by Mark. Mauricio will the licenses get updated automatically?	2013-01-11 02:03:43 -05:00
Eric Banks	3a51823c2a	Clean up imports	2013-01-10 23:35:01 -05:00
Eric Banks	e4b7b1955c	Forgot to add the note about length normalization to the QD docs	2013-01-10 23:34:06 -05:00
Eric Banks	ff5ac986d8	Fix docs for QD	2013-01-10 23:31:46 -05:00
Mauricio Carneiro	2a4ccfe6fd	Updated all JAVA file licenses accordingly GSATDG-5	2013-01-10 17:06:41 -05:00
Mauricio Carneiro	dd177b1714	Removing fully commented out varianteval evaluators - Files were completely commmented out, and were screwing up my license script. Dont like them. Removed them. GSATDG-5	2013-01-10 17:06:12 -05:00
Chris Hartl	80dec72c53	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-10 14:35:59 -05:00
Chris Hartl	31a5f88c4f	Expanded unit tests to cover the Concordance Metrics class fairly uniformly.	2013-01-10 14:33:47 -05:00
Ryan Poplin	1a18947abf	Adding new command line argument requested on the forum to control the maximum number of haplotypes that are sent forward for genotyping. In the presence of a large degree of heterozygosity the current algorithm breaks down and so this argument would need to be increased.	2013-01-09 15:54:02 -05:00
Ryan Poplin	487fb2afb4	Bug fix for the case of overlapping assembled and partially-assembled events created by the HC. Unfortunately the symbolic allele can't be combined with the indel allele because the reference basis will change.	2013-01-09 15:30:46 -05:00
Chris Hartl	6787f86803	Eliminate the import of DiploidGenotype, which switched public/private underneath me but for some reason didn't stop me from compiling...	2013-01-09 13:23:24 -05:00
Chris Hartl	c1de92b511	Add in some todo items	2013-01-09 13:16:06 -05:00
Chris Hartl	8d126161e2	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2013-01-09 13:15:04 -05:00
Eric Banks	3a0dd4b175	Oops, I broke the build. NOW we shouldn't have any more public->protected dependancies.	2013-01-09 11:12:28 -05:00
Eric Banks	a921b06e02	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-09 11:06:17 -05:00
Eric Banks	4fa439d89e	Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now	2013-01-09 11:06:10 -05:00
Ryan Poplin	396bce1f28	Reverting this change until we can figure out the right thing to do here.	2013-01-09 10:51:30 -05:00
Eric Banks	676e79542a	Bring CombineVariants back to public since it's used for SG. I needed to break ChromosomeCountConstants out of ChromosomeCounts to make this work.	2013-01-09 10:39:48 -05:00
Ryan Poplin	c87ad8c0ef	Bug fixes related to HC's GGA mode. Tracking just the artificial allele isn't sufficient when there are multiple GGA records that change the reference basis. Also, duplicated records screw up the tracking of merged alleles.	2013-01-09 10:00:46 -05:00
Chris Hartl	ad7c2a08d4	Normalize by the event type counts, not the total genotype counts: more useful normalization.	2013-01-09 09:12:41 -05:00
Chris Hartl	b56754606b	Initial break-out of GenotypeConcordance as a standalone walker. Some basic functionality testing. Currently performs only a pairwise comparison, but is very careful about proper tabulation through the GenotypeType enum.	2013-01-09 00:34:07 -05:00
Eric Banks	264cc9e78d	Resolve protected->public dependencies for BQSR by wrapping the BQSR-specific arguments in a new class. Instead of the GATK Engine creating a new BaseRecalibrator (not clean), it just keeps track of the arguments (clean). There are still some dependency issues, but it looks like they are related to Ami's code. Need to look into it further.	2013-01-08 16:23:29 -05:00
Eric Banks	ee7d85c6e6	Move around the DiploidGenotype classes (so it can be used by the GATKPaperGenotyper)	2013-01-08 15:53:11 -05:00
Eric Banks	0e2e672521	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-08 15:46:39 -05:00
Eric Banks	f0bd1b5ae5	Okay, all public->protected dependencies are gone except for the BQSR arguments. I'll need to think through this but should be able to make that work too.	2013-01-08 15:46:32 -05:00
Tad Jordan	9cbb2b868f	ErrorRatePerCycleIntegrationTest fix -- sorting by row is required	2013-01-08 14:53:07 -05:00
Eric Banks	b099e2b4ae	Moving integration tests to protected	2013-01-08 09:34:08 -05:00
Eric Banks	dfe4cf1301	When merging the PerReadAlleleLikelihoodMap classes, I forgot to initialize the underlying objects. This was causing the LargeScaleTests to fail.	2013-01-08 09:24:12 -05:00
Eric Banks	9e6c2afb28	Not sure why IntelliJ didn't add this for commit like the other dirs	2013-01-07 18:11:07 -05:00
Ami Levy-Moonshine	3787ee6de7	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-07 17:07:29 -05:00
Eric Banks	47d030a52d	Oops, move the covariates over too	2013-01-07 15:47:25 -05:00
Eric Banks	35699a8376	Move bqsr utils to protected	2013-01-07 15:41:21 -05:00
Eric Banks	a0219acfaa	Collapse the PerReadAlleleLikelihoodMap classes into 1 now that Lite is gone	2013-01-07 14:55:21 -05:00
Eric Banks	35d9bd377c	Moved (nearly) all Walkers from public to protected and removed GATKLite utils	2013-01-07 14:42:40 -05:00
Ryan Poplin	4f95f850b3	Bug fix in the HC's allele mapping for multi-allelic events. Using the allele alone as a key isn't sufficient because alleles change when the reference allele changes during VariantContextUtils.simpleMerge for multi-allelic events.	2013-01-07 11:05:44 -05:00
Ami Levy-Moonshine	d3c2c97fb2	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-06 23:35:47 -05:00
Ami Levy-Moonshine	81eef3aa37	merge development branchs of log-less HMM and FastGatherer to master	2013-01-06 23:01:58 -05:00
Eric Banks	52067f0549	Handle merge conflicts	2013-01-06 12:29:12 -05:00
Chris Hartl	41bc416b65	Remove AAL and update MD5s.	2013-01-04 16:46:14 -05:00
Eric Banks	bce6fce58d	Resolving merge conflicts after Mark's latest push	2013-01-04 14:46:39 -05:00
Eric Banks	dd7f5e2be7	Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests.	2013-01-04 14:43:11 -05:00
Mark DePristo	bbdf9ee91b	BQSR cleanup: merge Advanced and Standard recalibration engine into just the RecalibrationEngine -- As we are no longer maintaining a public/protected system we need only have one RecalibrationEngine. -- Misc. code cleanup and docs along the way	2013-01-04 11:39:24 -05:00
Mark DePristo	7df47418d8	BQSR optimization: make RecalibrationTables thread-local, and merge results in onTraversalDone -- With the newer, faster BQSR, scaling was limited by the NestedIntegerArray. The solution to this is to make the entire table thread-local, so that each nct thread has its own data and doesn't have any collisions. -- Removed the previous partial solution of having a thread-local quality score table -- Added a new argument -lowMemory	2013-01-04 11:39:24 -05:00
Chris Hartl	3753209584	One md5sum slipped past in the HC integration test.	2013-01-02 15:09:28 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Eric Banks	275575462f	Protect against non-standard ref bases. Ryan, please review.	2012-12-26 15:46:21 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	0f0188ddb1	Optimization of BQSR -- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine. This object has a restrictive interface to just get information about specific qual and error values at offset and for event type. This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads. Shaves 5% of the runtime off the entire code. -- Cleaned up code and added lots more docs -- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR. 95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.	2012-12-24 13:35:09 -05:00
Ami Levy-Moonshine	6590039bc3	add fast gather to UG; change UG to work with log-lessHMM (work in prograss)	2012-12-20 14:58:57 -05:00
Ryan Poplin	c8cd6ac465	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-20 14:58:04 -05:00
Ryan Poplin	a098888f4d	Updating missed UG md5	2012-12-20 14:57:53 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
Eric Banks	4a7e0427a3	Pushing the RR bug fix that I puished into unstable into stable, as requested by Tim	2012-12-19 11:47:16 -05:00
Ryan Poplin	54e5c84018	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-19 11:31:40 -05:00
Ryan Poplin	aa39037be8	updating UG integration tests.	2012-12-19 11:31:35 -05:00
Eric Banks	70479cb71d	RR bug fix: we were failing when a read started with an insertion just at the edge of the consensus region. The weird part is that the comments claimed it was doing what it was supposed to, but it didn't actually do it. Now we maintain the last header element of the consensus (but without bases and quals) if it adjoins an element with an insertion. Added the user's test file as an integration test.	2012-12-19 10:59:07 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Ryan Poplin	92185dd5f4	updating HC integration tests.	2012-12-19 10:12:07 -05:00
Ryan Poplin	902ca7ea70	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-18 15:45:33 -05:00
Ryan Poplin	b5d590ba92	Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph.	2012-12-18 15:43:56 -05:00
Mark DePristo	a481d006f0	Optimizations for applying BQSR table with PrintReads -- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops -- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one -- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable. This stops us from recreating an expensive array over and over. In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping. Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.	2012-12-17 16:47:27 -05:00
Mark DePristo	5ec25797b3	Optimizations for BaseRecalibrator -- No longer computes at each update the overall read group table. Now computes this derived table only at the end of the computation, using the ByQual table as input. Reduces BQSR runtime by 1/3 in my test	2012-12-17 16:47:27 -05:00
Ryan Poplin	98f18b5f9e	Changing the HC over to using the non-contamination-downsampled read maps for the purposes of annotations. This behavior now matches the UG. There is a new command line option to go back to the older behavior to explore the differences.	2012-12-17 11:27:44 -05:00
Mauricio Carneiro	5f1afb4136	Fixing an off-by-one clipping error in ReduceReads for reads off the contig Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.	2012-12-13 22:10:11 -05:00
Mauricio Carneiro	74344a3871	Bringing in the changes from the CMI repo	2012-12-13 21:59:37 -05:00
Mark DePristo	aeab932c63	Actual working version of unflushing VCFWriter -- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream. -- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping -- Same (or better) IO performance as previous implementation -- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed -- Updated compressed VCF output file	2012-12-13 16:15:08 -05:00
Ryan Poplin	211a6e78ea	Further related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed.	2012-12-12 14:53:02 -05:00
Mauricio Carneiro	33290bfe0c	Added integration test to catch the read off contig in ReduceReads. So upstream changes won't break it again.	2012-12-12 13:49:54 -05:00
Mark DePristo	5632c13bf2	Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush(). -- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)	2012-12-12 10:27:07 -05:00
Mark DePristo	dd52a70d45	Fix AFCalcResult unit test -- I was simply passing in the wrong values into the function. Fixed the calls, and expanded the docs on what needs to be passed in.	2012-12-11 10:40:12 -05:00
Mauricio Carneiro	8a115edbaf	ReduceReads is now scattered by contig It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.	2012-12-10 15:25:27 -05:00
Eric Banks	bdda63d973	Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed.	2012-12-10 14:47:04 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Eric Banks	406adb8d44	The allele biased downsampling should not abort if there's a reduced read. Rather it should always keep the RR and downsample only original reads in the pileup.	2012-12-05 23:15:36 -05:00
Mark DePristo	d0cab795b7	Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata.	2012-12-05 14:49:22 -05:00
Eric Banks	ef87b18e09	In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then).	2012-12-05 02:00:35 -05:00
Eric Banks	726332db79	Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally	2012-12-05 00:54:00 -05:00
Eric Banks	bca860723a	Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable.	2012-12-03 22:01:07 -05:00
Ryan Poplin	d5ed184691	Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity.	2012-12-03 15:38:59 -05:00
Ryan Poplin	156d6a5e0b	misc minor bug fixes to GenotypingEngine.	2012-12-03 12:47:35 -05:00
Ryan Poplin	18b002c99c	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 10:08:56 -05:00
Ryan Poplin	1bdf17ef53	Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations.	2012-12-02 11:58:32 -05:00
Mark DePristo	2849889af5	Updating md5 for UG	2012-12-01 14:24:19 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Mark DePristo	c676853731	Merged bug fix from Stable into Unstable. Updating md5s Conflicts: protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java	2012-11-28 12:54:36 -05:00
Mark DePristo	a1d6461121	Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds As reported by Menachem Fromer: a critical bug in AFCalcResult: Specifically, the implementation: public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) { return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef; } seems incorrect and should probably be: getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef The issue here is that the 30 represents a Phred-scaled probability of error and it's currently being compared to a log probability of non-error. Instead, we need to require that our probability of error be less than the error threshold. This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones. Change was to create a new function that properly handles thresholds that are PhredScaled quality scores: /** * Same as #isPolymorphic but takes a phred-scaled quality score as input */ public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) { if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 "); final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual)); return isPolymorphic(allele, log10Threshold); }	2012-11-28 12:08:02 -05:00
Mauricio Carneiro	97fd5de260	Merging latest CMI updates with UNSTABLE	2012-11-27 09:08:00 -05:00
Ryan Poplin	59cef880d1	Updating HC integration tests because experimental, HC-specific annotations have been removed.	2012-11-26 12:20:07 -05:00
Ryan Poplin	c3b7dd1374	Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine	2012-11-26 12:19:11 -05:00
Ryan Poplin	fedc4fde6c	Merged bug fix from Stable into Unstable	2012-11-25 21:55:55 -05:00
Ryan Poplin	d978cfe835	Soft clipped bases shouldn't be counted in the delocalized BQSR.	2012-11-25 21:55:29 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Eric Banks	f0b8a0228f	Quick fix for HC refactoring: when copying over Haplotype objects, make sure to copy over the artificial allele used to create it too.	2012-11-19 09:57:55 -05:00
Eric Banks	ff180a8e02	Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process.	2012-11-19 09:09:57 -05:00
Mauricio Carneiro	8b749673bc	centralize header element removal in reduce reads	2012-11-14 13:59:34 -05:00
Mauricio Carneiro	e35fd1c717	Merging CMI-0.5.0 and GATK-2.2 together.	2012-11-14 10:42:03 -05:00
Mauricio Carneiro	a079d8d0d1	Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	dba31018f4	Implementation of BySampleSAMFileWriter ReduceReads now works with the n-way-out capability, splitting by sample. DEV-27 #resolve #time 3m	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	a17cd54b68	Co-Reduction implementation in ReduceReads ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples. Also: * Added integrationtest for co-reduction * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly) DEV-200 #resolve #time 8m	2012-11-14 10:33:21 -05:00
Eric Banks	e93d461910	Adding integration test to BQSR for the csv file	2012-11-09 09:11:04 -05:00
Eric Banks	2da76db945	Updating integration tests	2012-11-06 22:23:05 -08:00
Eric Banks	0a2dded093	Fixes for bugs uncovered by unit tests	2012-11-06 16:07:40 -08:00
Eric Banks	b07106b3a7	Reimplement the allele biased downsampling to be smarter. Now we don't blindly pull n% of reads off of each allele. Instead, we try all possible genotype conformations for the contaminating sample and choose the one that provides the best genotype for the target sample (based heuristically on allele balance). This method allows us to save some of the reads that belong to the target sample, which should make Daniel M happy. Added unit tests to test the biased downsampling functionality.	2012-11-06 14:39:58 -08:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Eric Banks	f8af8a2355	Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite.	2012-10-31 21:28:07 -04:00
Guillermo del Angel	51a9ce28e1	Merge remote-tracking branch 'unstable/master' into develop	2012-10-31 10:29:48 -04:00
Ryan Poplin	4e661847b2	DelocalizedBaseRecalibrator becomes the BaseRecalibrator.	2012-10-29 12:53:39 -04:00
David Roazen	35483a7eef	Update MD5s for PrintReads with BQSR Integration Test The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d to match the output of a broken version of BaseRecalibration. With the patch in commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the original MD5s for these tests, and does not vary as you increase -nct. Final resolution to GSA-632	2012-10-26 14:25:25 -04:00
Eric Banks	ed11b7dab2	Fix UG parallelization test	2012-10-26 12:10:44 -04:00
Eric Banks	7a706ed345	Fix some of the broken integration tests	2012-10-26 11:23:44 -04:00
Eric Banks	b06f689d4b	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 02:13:26 -04:00
Eric Banks	a53e03d525	Do not let reduced reads get removed in the contamination down-sampling	2012-10-26 02:13:04 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Mark DePristo	cc8c12b954	Committing a broken version of BaseRecalibration -- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA. -- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation -- Updating GATKPerformanceOverTime with the new @ClassType annotation	2012-10-25 14:46:35 -04:00
Eric Banks	df9e0b7045	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 02:49:54 -04:00
Eric Banks	72714ee43e	Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging.	2012-10-25 02:47:42 -04:00
Eric Banks	c6b57fffda	Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing.	2012-10-24 22:52:25 -04:00
Eric Banks	9da7bbf689	Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only.	2012-10-24 15:49:07 -04:00
David Roazen	02018ca764	Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable The old BaseRecalibrator walker is and never will be thread-safe, since it's a LocusWalker that uses read attributes to track state. ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1 for testing purposes, but wait for further testing to be done before using it for production purposes in multithreaded mode.	2012-10-24 15:22:50 -04:00
David Roazen	991658acf4	BQSR: use more granular locking for concurrency control -With this change, BQSR performance scales properly by thread rather than gaining nothing from additional threads. -Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct (NanoScheduler) -Removes high-level locks in the recalibration engines and NestedIntegerArray in favor of maximally-granular locks on and around manipulation of the leaf nodes of the NestedIntegerArray. -NestedIntegerArray now creates all interior nodes upfront rather than on the fly to avoid the need for locking during tree traversals. This uses more memory in the initial part of BQSR runs, but the BQSR would eventually converge to use this memory anyway over the course of a typical run. IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator walker with multiple threads. The BaseRecalibrator walker is and will never be thread-safe, as it's a LocusWalker that uses read attributes to track state information. ONLY the newer DelocalizedBaseRecalibrator can be made thread-safe (and will hopefully be made so in my subsequent commits). This commit addresses performance, not correctness.	2012-10-24 15:22:50 -04:00
Ryan Poplin	a27ee26481	updating HC integration test.	2012-10-24 14:08:39 -04:00
Ryan Poplin	094db7bf24	We now require at least 10 samples to merge variants into complex events in the HC. Added a new population based bam for the complex event integration test.	2012-10-24 14:07:36 -04:00
Mauricio Carneiro	4cd1a92358	Updating RR integration tests Forgot to update the integration tests after merging DEV-117 with optimizations from GATK main repo.	2012-10-23 11:26:26 -04:00
Mauricio Carneiro	c210b7cde4	Merge GATK repo into CMI-GATK Bringing in the following relevant changes: * Fixes the indel realigner N-Way out null pointer exception DEV-10 * Optimizations to ReduceReads that bring the run time to 1/3rd. Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java DEV-10 #resolve #time 2m	2012-10-23 10:59:11 -04:00
Mauricio Carneiro	bbf7a0fb09	Adding integration test to ReduceReads coreduction DEV-117 #resolve	2012-10-23 10:56:33 -04:00
Mark DePristo	90f59803fd	MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels -- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles -- All constructores, factory methods, and test builders and their users updated to provide just a single argument -- Updating MD5s for integration tests that change due to genotyping more alleles -- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here). -- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through	2012-10-22 13:47:56 -04:00
Eric Banks	ccae6a5b92	Fixed the RR bug I (knowingly) introduced last week: turns out we can't trust a context size's worth of data from the previous marking. I think Mauricio warned me about this but I forgot.	2012-10-22 11:48:34 -04:00
Mark DePristo	9f2851d769	Updating UnifiedGenotyperGeneralPloidyIntegrationTest following rebasing -- Created a JIRA ticket https://jira.broadinstitute.org/browse/GSA-623 for Guillermo to look at the differences as the multi-allelic nature of many sites seems to change with the new more protected infrastructure. This may be due to implementation issues in the pooled caller, problems with my interface, or could be a genuine improvement.	2012-10-21 20:23:11 -04:00
Mark DePristo	d21e42608a	Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default	2012-10-21 12:43:46 -04:00
Mark DePristo	0fcd358ace	Original EXACT model implementation lives, providing another reference (bi-allelic only) EXACT model -- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case -- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations -- Added systematic unit tests covering this implementation, and comparing it to others -- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases -- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right	2012-10-21 12:42:31 -04:00
Mark DePristo	eaffb814d3	IndependentExactAFCalc is now the default EXACT model implementation -- Changed UG / HC to use this one via the StandardCallerArgumentCollection -- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums	2012-10-21 12:42:31 -04:00
Mark DePristo	326f429270	Bugfixes to make new AFCalc system pass integrationtests -- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests -- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base -- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X. Added unit test to ensure this behavior is correct -- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order. Add contract to ensure this doesn't ever happen again -- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model --	2012-10-21 12:42:31 -04:00
Mark DePristo	695cf83675	More docs and contracts for classes in genotyper.afcalc -- Future protection of the output of GeneralPloidyExactAFCalc, which produces in some cases bad likelihoods (positive values)	2012-10-21 12:42:31 -04:00
Mark DePristo	99c9031cb4	Merge AFCalcResultTracker into StateTracker, cleanup -- These two classes were really the same, and now they are actually the same! -- Cleanuped the interfaces, removed duplicate data -- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed) -- Moved goodProbability and goodProbabilityVector utilities to MathUtils. Very useful for contracts!	2012-10-21 12:42:31 -04:00
Guillermo del Angel	e9b7324dc1	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-21 12:38:49 -04:00
Guillermo del Angel	67b9e7319e	Fix for integration tests: new criterion in AF exact calculation model to trim alleles based on likelihoods does produce better results and resulting alleles changed in 2 sites at integration tests (and all subsequent sites after this had minor annotation differences due to RankSum dithering)	2012-10-21 12:38:33 -04:00
Eric Banks	0616b98551	Not sure why we were setting the UAC variables instead of the simpleUAC ones when that's what we wanted.	2012-10-21 08:26:26 -04:00
Eric Banks	2c624f76c8	Refactoring the Unified (and Standard) Argument Collections because it was really ugly that the subclass had to do all the cloning for the super class. The clone() method is really not recommended best practice in Java anyways, so I changed it so that we use standard overloaded constructors. Confirmed that the Haplotype Caller --help docs do not include UG-specific arguments.	2012-10-20 20:35:54 -04:00
Ryan Poplin	a647f1e076	Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore.	2012-10-20 16:38:18 -04:00
Eric Banks	4622896312	Oops, killed contracts	2012-10-19 13:04:05 -04:00
Eric Banks	f7bd4998fc	No need for dummy GLs	2012-10-19 12:13:59 -04:00
Eric Banks	deca564aef	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-19 12:01:49 -04:00
Eric Banks	d3cf37dfaf	Bug fix for general ploidy model: when choosing the most likely alternate allele(s), you need to weight the likelihood mass by the ploidy of the specific alleles (otherwise all alt alleles will have the same probability). This fixes Yossi's issue with pooled validation calling. This may brek integration tests, but I will leave that to GdA to handle.	2012-10-19 12:01:45 -04:00
Eric Banks	27d8d3f51e	RR optimization: don't recalculate the entire bitset of variant sites for every read added to the sliding window. Instead, reuse as much of the previously calculated bitset as you can (basically from the window start until the start of the new read minus the context size). In some awfully performing regions this cuts down the runtime in half, although in others this doesn't seem to help much (so clearly something else is going on). Note that I still need to fix one last bug here, but it's almost done.	2012-10-19 11:59:34 -04:00
Ryan Poplin	b4e69239dd	In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map.	2012-10-18 14:31:15 -04:00
Eric Banks	20ffbcc86e	RR optimization: profiling was showing that the BaseCounts class was a major bottleneck because the underlying implementation was a HashMap. Given that the map index was an indexable Enum anyways, it makes a lot more sense to implement as a native array. Knocks 30% off the runtime in bad regions.	2012-10-17 21:44:53 -04:00
Mauricio Carneiro	32ee2c7dff	Refactored the compression interface per sample in ReduceReadsa The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide. - Updated ReduceReads integration tests - Sliding Window is now using the CompressionStash (single sample). DEV-104 #resolve #time 3m	2012-10-17 16:40:40 -04:00
Mauricio Carneiro	b57df6cac8	Bringing CMI changes into the main GATK repo. Merge remote-tracking branch 'cmi/master'	2012-10-17 15:23:19 -04:00
Mark DePristo	fa93681f51	Scalability test for EXACT models	2012-10-17 14:15:11 -04:00

... 4 5 6 7 8 ...

772 Commits (0018af0c0af3100d220315cc0b21b76b86f0e415)