Commit Graph

312 Commits (6ed9eb3da9eed02e54dd893a4f4fb60b4caa514b)

Author SHA1 Message Date
Mauricio Carneiro 2c3dc291c0 Added positive/negative strand to the synthetic reads 2012-09-21 10:00:48 -04:00
Mauricio Carneiro 51cb5098e4 Fixed the alignment issues with reads that started with empty consensus headers 2012-09-21 10:00:47 -04:00
Mauricio Carneiro aa1d2f3a5b Not every consensus is well aligned. Need to check more, but starting position has been fixed. 2012-09-21 10:00:45 -04:00
Mauricio Carneiro 97874b92d1 Program runs, but the consensus reads are all out of place and need more tags 2012-09-21 10:00:44 -04:00
Mauricio Carneiro 3494a52ddc another intermediate commit to update changes from stable 2012-09-21 10:00:43 -04:00
Mauricio Carneiro a89ff7b5dd Intermediate commit to resolve conflicts coming from stable 2012-09-21 10:00:41 -04:00
Eric Banks 1316b579f0 Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table. 2012-09-20 14:14:34 -04:00
Eric Banks 4b7edc72d1 Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call). 2012-09-20 10:59:42 -04:00
Mauricio Carneiro ee31a54a03 Merged bug fix from Stable into Unstable 2012-09-19 16:09:45 -04:00
Mauricio Carneiro 7cf9911924 Fixed ReduceReads bug where variant regions were missing.
This affected variant regions with more than 100 reads and less than 250 reads. Only bams reduced with GATK v2 and 2.1 were affected.
2012-09-19 16:09:08 -04:00
Ryan Poplin 26e35e5ee2 updating BQSR integration tests 2012-09-19 14:10:34 -04:00
Ryan Poplin b99099f05c The BaseRecalibrator and DelocalizedBaseRecalibrator have gotten out of sync. Fixing. 2012-09-19 12:30:26 -04:00
Ryan Poplin 7a7103a757 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-19 10:39:18 -04:00
Guillermo del Angel bebd5c14b8 Update general ploidy md5's due to bad merge of md5's in previous commit, and new shortened interval definition for EMIT_ALL_CONFIDENT_SITES was buggy 2012-09-18 20:12:15 -04:00
Guillermo del Angel ca010160a9 Merge fix 2012-09-14 14:05:21 -04:00
Guillermo del Angel 6b37350bc0 Two hairy bugs in pool caller: a) Site error model wasn't counting errors in insertions correctly - Alleles passed in had padded ref byte, but event base in PileupElement doesn't have it. As a result, mismatch rate was grossly overestimated with insertions and we missed several calls we should have made. Integration test reflects changes. b) Adding a ref GL to the exact model is correct mathematically but AFResult wasn't filled properly. As a result, QUAL was junk in pure ref sites, and in all other sites the last ref GL introduced wasn't properly updating Pr(AF>0). c) Added integration test that covers -out_mode EMIT_ALL_CONFIDENT_SITES. Not fully sure if the math is 100% correct (for both diploid and generalized case) but at least now diploid and non-diploid cases behave similarly. md5 of this new test will fail since it's taking me a long time to run so I'll update from Bamboo output shortly 2012-09-14 13:13:22 -04:00
Eric Banks 0206e09a6a Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 15:18:27 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Ryan Poplin c9111bb23e Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 14:46:50 -04:00
Ryan Poplin 849a2b8839 Adding HC integration test for _structural_ insertions and deletions. 2012-09-12 12:23:00 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Ryan Poplin 35d15278af Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-11 14:34:17 -04:00
Guillermo del Angel 13831106d5 Fix GSA-535: storing likelihoods in allele map was busted when running HaplotypeCaller, only the last likelihood of a haplotype was being stored, as opposed to the max likelihood of all haplotypes mapping to an allele 2012-09-11 11:01:26 -04:00
Ryan Poplin aa9829b55c fixing typo 2012-09-10 13:36:37 -04:00
Guillermo del Angel 10c720cbba Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-10 09:56:47 -04:00
Guillermo del Angel 2d4b00833b Bug fix for logging likelihoods in new read allele map: reads which were filtered out were being excluded from map, but they should be included in annotations 2012-09-09 20:35:45 -04:00
Ryan Poplin 36913706c0 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 13:47:54 -04:00
Ryan Poplin 688fc9fb56 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 10:36:09 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Yossi Farjoun d6884e705a Revert "fixed a typo in StringText.properties"
This reverts commit b74c1c17e748f75e59d23545084b983e2a8d2fa6.
2012-09-05 15:21:00 -04:00
Yossi Farjoun f4b39a7545 Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable
merging trivially after a commit
2012-09-05 14:33:39 -04:00
Yossi Farjoun 6e517df5d9 fixed a typo in StringText.properties 2012-09-05 14:33:08 -04:00
Ryan Poplin 9cc1a9931b Resolving merge conflicts. 2012-09-04 10:47:38 -04:00
Ryan Poplin c9944d81ef Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR. 2012-09-04 10:33:37 -04:00
Mark DePristo 1b0ce511a6 Updating BQSR tests due to my change to reset BQSR calibration data 2012-08-31 19:51:09 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Ryan Poplin 57d997f06f Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start 2012-08-30 10:10:43 -04:00
Ryan Poplin 35baf0b155 This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test. 2012-08-30 09:07:36 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Mark DePristo 0f4acaae1b Update MD5s with new FS score 2012-08-28 08:06:47 -04:00
Mark DePristo b3fd74f0c4 HaplotypeCaller forbids BAQ 2012-08-24 13:25:05 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 6a8cf1c84a Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations 2012-08-21 14:35:40 -04:00
Guillermo del Angel d0644b3565 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:35:23 -04:00
Ryan Poplin 94e7f677ad Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:21:47 -04:00
Guillermo del Angel 418ace463a More merge conflict resolution 2012-08-21 10:15:52 -04:00
Ryan Poplin 10961db3ce Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous. 2012-08-21 09:35:55 -04:00
Ryan Poplin 605acaae9c Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous. 2012-08-21 09:33:58 -04:00
Ryan Poplin 55b7949d68 Another round of FindBugs fixes. Comparator doesn't implement Serializable. 2012-08-21 09:20:55 -04:00
Eric Banks 286b658fab Re-enabling parallelism in the BaseRecalibrator now that the release is out. 2012-08-20 21:25:14 -04:00
Guillermo del Angel 7bbd2a7a20 Fixing merge conflicts 2012-08-20 20:38:25 -04:00
Ryan Poplin 77fbaec044 Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior. 2012-08-20 16:55:00 -04:00
Ryan Poplin a9472c1980 Another round of FindBugs fixes. Inefficient use of keySet iterator instead of entrySet iterator. 2012-08-20 16:11:45 -04:00
Ryan Poplin 5db3bd6fd2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 15:28:57 -04:00
Ryan Poplin 464d49509a Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller. 2012-08-20 15:28:39 -04:00
Ryan Poplin c67d708c51 Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them. 2012-08-20 13:41:08 -04:00
Eric Banks 154f65e0de Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons. 2012-08-20 12:43:17 -04:00
Guillermo del Angel 963ad03f8b Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet 2012-08-19 21:18:18 -04:00
Guillermo del Angel b61ecc7c19 Fix merge conflicts 2012-08-16 20:45:52 -04:00
Guillermo del Angel d26183e0ec First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Eric Banks 05cbf1c8c0 FindBugs 'Efficiency' fixes 2012-08-16 15:40:52 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Eric Banks 9035b554fb Adding tests for the --solid_nocall_strategy argument 2012-08-15 23:13:24 -04:00
Mark DePristo 3556c36668 Disable general ploidy integration tests because they are running forever 2012-08-15 21:13:16 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Ryan Poplin 2a113977a9 Resolving merge conflicts with the new MD5s 2012-08-10 11:47:00 -04:00
Ryan Poplin 5f82ffd5d8 Adding LowQual filter to the output of the HaplotypeCaller. 2012-08-10 11:25:14 -04:00
Ryan Poplin 9887bc4410 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 16:31:06 -04:00
Mauricio Carneiro abb168e1ba Merged bug fix from Stable into Unstable 2012-08-09 16:09:58 -04:00
Mauricio Carneiro 67d4148b32 Fixing but reported by Thomas in the forum where reads were soft-clipped beyond the limits of the contig and ReduceReads was failing with a NoSuchElement exception. Now we hard clip anything that goes beyond the boundaries of the contig. 2012-08-09 15:58:18 -04:00
Mauricio Carneiro 58420098ac Merged bug fix from Stable into Unstable 2012-08-09 13:02:23 -04:00
Mauricio Carneiro c6132ebe26 Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out. Added Guillermo's bug report as an integration test 2012-08-09 13:02:11 -04:00
Ryan Poplin e48727dae3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 10:31:10 -04:00
Guillermo del Angel 5be7e0621d Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 09:58:34 -04:00
Guillermo del Angel 71ee8d87b3 Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header 2012-08-09 09:58:20 -04:00
Mauricio Carneiro 250ffd2ad7 Merged bug fix from Stable into Unstable 2012-08-08 15:50:07 -04:00
Mauricio Carneiro 78c1556186 Fixing ReduceReads downsampling bug -- downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception 2012-08-08 15:49:31 -04:00
Ryan Poplin 1223d77546 Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events 2012-08-08 15:13:20 -04:00
Eric Banks 4b2e3cec0b Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools. 2012-08-08 14:29:41 -04:00
Guillermo del Angel 3e2752667c Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts 2012-08-08 12:07:33 -04:00
Eric Banks 2c76f71a03 Update -maxAlleles argument in integration tests 2012-08-06 22:48:04 -04:00
Guillermo del Angel c66a896b8e Fix UG integration test broken by new -maxAltAlleles nomenclature 2012-08-06 21:29:21 -04:00
Guillermo del Angel 97c5ed4feb Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 20:22:31 -04:00
Guillermo del Angel 238d55cb61 Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead 2012-08-06 20:22:12 -04:00
Ryan Poplin d85b38e4da Updating HaplotypeCaller integration tests 2012-08-06 12:02:19 -04:00
Ryan Poplin b8709d8c67 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 11:41:28 -04:00
Ryan Poplin 973d1d47ed Merging together the computeDiploidHaplotypeLikelihoods functions in the HaplotypeCaller's LikelihoodEngine so they both benefit from the ReducedRead's RepresentativeCount 2012-08-06 11:40:07 -04:00
Ryan Poplin b7eec2fd0e Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly. 2012-08-05 12:29:10 -04:00
Guillermo del Angel d2e8eb7b23 Fixed 2 haplotype caller unit tests: a) new interface for addReadLikelihoods() including read counts, b) disable test that test basic DeBruijn graph assembly, not ready yet 2012-08-03 14:26:51 -04:00
Ryan Poplin c3b6e2b143 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 13:14:43 -04:00
Ryan Poplin ff80f17721 Using PathComparatorTotalScore in the assembly graph traversal does a better job of capturing low frequency branches that are inside high frequnecy haplotypes. 2012-08-03 13:14:37 -04:00
Guillermo del Angel 6f8e7692d4 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 12:24:37 -04:00
Guillermo del Angel 9e25b209e0 First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller 2012-08-03 12:24:23 -04:00
Ryan Poplin 3ece4c4993 Merged bug fix from Stable into Unstable 2012-08-02 11:41:36 -04:00
Ryan Poplin cb8bc18aeb Fix for error in HaplotypeCaller. HC has a UG argument collection for the UG engine but some of those arguments aren't appropriate to set. 2012-08-02 11:41:06 -04:00
Guillermo del Angel 9ac72dbd4d Merged bug fix from Stable into Unstable 2012-08-01 10:56:45 -04:00
Guillermo del Angel 01265f78e6 Add sanity check and possible bug fix for forum user: if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) in pool mode, empty allele list, signifying site can't be genotyped 2012-08-01 10:50:00 -04:00
Guillermo del Angel 4a23f3cd11 Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage 2012-07-31 16:34:20 -04:00
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Guillermo del Angel e6b326c189 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 21:32:19 -04:00
Guillermo del Angel 6c9d3ec155 Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider 2012-07-30 21:32:03 -04:00
Ryan Poplin 3dabb90eb0 Updating example active region walker integration test. 2012-07-30 21:26:16 -04:00
Ryan Poplin c2b57ee444 updating HC integration tests after these changes. 2012-07-30 12:41:40 -04:00
Ryan Poplin 13591b169f Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 12:13:24 -04:00
Ryan Poplin 48b9495460 Fixes to the likelihood based LD calculation for deciding when to combine consecutive events. 2012-07-30 12:12:56 -04:00
Ryan Poplin 9002758ede Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 12:10:00 -04:00
Ryan Poplin 7a73042cd3 Bug fix for the case of merging two VCs when a deletion deletes the padding base for a consecutive indel. Added unit test to cover this case. 2012-07-30 12:09:23 -04:00
Eric Banks 5743694196 Merged bug fix from Stable into Unstable 2012-07-30 11:35:28 -04:00
Eric Banks 79195b97a3 Adding categories for the remaining uncategorized walkers 2012-07-30 11:35:08 -04:00
Guillermo del Angel 5b9a1af7fe Intermediate fix for pool GL unit test: fix up artificial read pileup provider to give consistent data. b) Increase downsampling in pool integration tests with reference sample, and shorten MT tests so they don't last too long 2012-07-30 09:56:10 -04:00
Eric Banks 99b15b2b3a Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push. 2012-07-29 01:07:59 -04:00
Eric Banks 2b1b00ade5 All integration tests and VC/Allele unit tests are passing 2012-07-27 17:03:49 -04:00
Eric Banks beb7610195 Resolving merge conflicts 2012-07-27 15:52:02 -04:00
Eric Banks 27e7e11ec0 Allele refactoring checkpoint #3: all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this. 2012-07-27 15:48:40 -04:00
Ryan Poplin 22bb4804f0 HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region. 2012-07-27 12:44:02 -04:00
Ryan Poplin a0890126a8 ActiveRegionWalker's isActive function returns a results object now instead of just a double. 2012-07-27 11:01:39 -04:00
Eric Banks baf3e33730 Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass. 2012-07-26 23:27:11 -04:00
Ryan Poplin 35e803e110 Merged bug fix from Stable into Unstable 2012-07-26 14:00:04 -04:00
Ryan Poplin 4f741b4cd7 Smoothing in the BQSR bins should be one error observation and one non-error observation. 2012-07-26 13:59:02 -04:00
Guillermo del Angel 2ae890155c Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy 2012-07-26 13:43:00 -04:00
Eric Banks a694d1b5de Merge branch 'master' into allelePadding 2012-07-26 01:53:14 -04:00
Eric Banks 32516a2f60 Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point. 2012-07-26 01:50:39 -04:00
Eric Banks 7eb3f54750 Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers. 2012-07-25 21:40:28 -04:00
Eric Banks 0a98a6aa8d Adding extraDocs tag per Mauricio's request 2012-07-25 18:23:18 -04:00
Eric Banks 05fa377a8e Adding GATK categories to standard walkers. Will add to remaining walkers after the next successful release (so that I can see which walkers are public and still need it). 2012-07-25 16:05:47 -04:00
Eric Banks a5721a8846 Context covariate optimizations were not suited for multiple threads, so I removed them (since that ended up being much, much easier than trying to make the covariates thread local). Added -nt 2 layer to BQSR integration tests to confirm that it now works with multiple threads. 2012-07-25 13:38:07 -04:00
Eric Banks 675ccab2fa Renaming BQSR to BaseRecalibrator 2012-07-23 10:17:17 -04:00
Ryan Poplin 2e486d83e2 Updating HaplotypeCaller docs and expanding integration tests. 2012-07-23 10:05:42 -04:00
Mauricio Carneiro df965d4a5a Fixing BQSR integration test 2012-07-21 11:11:45 -04:00
Mauricio Carneiro 116885a450 Removed the "Walker" suffix from all walkers that had it.
* Did not touch archived walkers... those can be named whatever.
   * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
   * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
2012-07-20 17:27:11 -04:00
Ryan Poplin 1592841c93 New function for merging nearby events into MNPs or complex substitutions. Added extensive unit tests. 2012-07-19 13:16:33 -04:00
Guillermo del Angel c16f9f2f15 a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress 2012-07-19 10:24:08 -04:00
Eric Banks 5f5edeca63 Reverting move of BQSR tests to public, as per DR's email 2012-07-19 10:02:05 -04:00
Eric Banks 9c1ab1b0c0 Move BQSR integration test and its dependent files into public; previously there was a protected->private dependency. 2012-07-18 21:11:33 -04:00
Mark DePristo 74e153ff4a FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Eric Banks e4db8dde91 Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0! 2012-07-17 23:36:43 -04:00
Guillermo del Angel 731bbba2e6 Bug fixes for integration test, use correct new UG syntax 2012-07-17 16:57:59 -04:00
Guillermo del Angel 40b8c7172c Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to. 2012-07-17 15:27:04 -04:00
Eric Banks b0d99fd10d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 15:12:28 -04:00
Eric Banks 305db8c0d1 Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us. 2012-07-17 15:11:03 -04:00
Ryan Poplin bf2d5efe4d Moving HaplotypeCaller integration and unit tests over to protected as well. 2012-07-17 14:51:26 -04:00
Ryan Poplin c55934043e Moving HaplotypeCaller from private to protected 2012-07-17 14:41:19 -04:00
Eric Banks 3a64398d07 Cleaned up the isGATKLite check 2012-07-17 12:46:16 -04:00
Eric Banks 62c5228048 1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method. 2012-07-17 12:23:40 -04:00
Eric Banks 40618ac471 A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots). 2012-07-17 10:52:43 -04:00
Eric Banks d5b3a2eabf Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 00:32:53 -04:00
Eric Banks f657b8bda8 Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow. 2012-07-17 00:32:34 -04:00
Eric Banks 0a89adbcdb Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one). 2012-07-16 15:34:50 -04:00
Eric Banks 52baac1e16 Move BQSRv2 into public and v1 into the archive. 2012-07-16 14:23:38 -04:00
Joel Thibault 6c6a324583 Loosen a restriction on isOriginalRead()
* no longer needs to satisfy ReadAndIntervalOverlap.OVERLAP_CONTAINED
2012-07-16 14:07:10 -04:00
Eric Banks d7bf74fb7e Updating default value for -mindel to the one used by Khalid in the pipeline and me in my tests. 2012-07-10 02:04:26 -04:00
Mauricio Carneiro 6c17c50fa2 Updates to ReduceReads
* Added optional parameter to not hard clip on the interval border
   * Made not clipping the default behavior (hence integration tests changed)
   * Updated integration tests.
2012-07-09 13:46:51 -04:00
Mauricio Carneiro 12d1c594df Moving ReduceReads into protected 2012-06-25 17:01:33 -04:00
David Roazen 9c6bccfd8b build system overhaul
* Added support for a protected directory whose contents are only made public in binary form

* Simplified and reorganized build.xml to improve readability and maintainability

* build.xml now autodetects most build properties:
    -Includes private/protected if they exist
    -No more STING_BUILD_TYPE or specialized targets for public-only, etc.

* Build targets have changed! There are now two main build options:

"ant"       build everything (GATK and Queue)
"ant gatk"  build just the GATK

It was too hard to build everything before -- now it is the default.

* To run tests with debugging, use -Dtest.debug=true -Dtest.debug.port=XXXX on the command line.
  Much better than the old comment/uncomment method!
2012-05-17 15:16:29 -04:00