Commit Graph

2971 Commits (0a56fe5bc33f9dfd40e25005f2e037bc36e9ffdc)

Author SHA1 Message Date
Mark DePristo 980685af16 Fix GSA-137: Having both DataSource.REFERENCE and DataSource.REFERENCE_BASES is confusing to end users.
-- Removed REFERENCE_BASES option.  You only have REFERENCE now.  There's no efficiency savings for the REFERENCE_BASES option any longer, since the reference bases are loaded lazy so if you don't use them there's effectively no cost to making the RefContext that could load them.
2012-08-17 14:55:38 -04:00
Eric Banks 2676b7fc2e Put in a sanity check that MLEAC <= AN 2012-08-17 11:49:53 -04:00
Mark DePristo daa26cc64e Print to logger not to System.out in CachingIndexFastaSequenceFile when profiling cache performance 2012-08-17 11:49:02 -04:00
Mark DePristo be0f8beebb Fixed GSA-434: GATK should generate error when gzipped FASTA is passed in.
-- The GATK sort of handles this now, but only if you have the exactly correct sequence dictionary and FAI files associated with the reference.  If you do, the file can be .gz.  If not, the GATK will fail on creating the FAI and DICT files.  Added an error message that handles this case and clearly says what to do.
2012-08-17 11:49:02 -04:00
Mark DePristo a3d2764d11 Fixed: GSA-392 @arguments with just a short name get the wrong argument bindings
-- Now blows up if an argument begins with -.  Implementation isn't pretty, as it actually blows up during Queue extension creation with a somewhat obscure error message but at least its something.
2012-08-17 11:49:01 -04:00
Mark DePristo 4c0f198d48 Potential fix for GSA-484: Incomplete writing of temp BCF when running CombineVariants in parallel
-- Keep reading from BCF2 input stream when read(byte[]) returns < number of needed bytes
-- It's possible (I think) that the failure in GSA-484 is due to multi-threading writing/reading of BCF2 records where the underlying stream is not yet flushed so read(byte[]) returns a partial result.  No loops until we get all of the needed bytes or EOF is encounted
2012-08-17 11:49:01 -04:00
Mark DePristo de3be45806 Proper function call in BCF2Decoder to validateReadBytes 2012-08-17 11:49:01 -04:00
Eric Banks 53383e82ec Hmm, not good. Fixing the math in PBT resulted in changed MD5s for integration tests that look like significant changes. I am reverting and will report this to Laurent. 2012-08-16 21:41:18 -04:00
Eric Banks 65c594afff Better error message for reads that begin/end with a deletion in LIBS 2012-08-16 21:27:07 -04:00
Guillermo del Angel b61ecc7c19 Fix merge conflicts 2012-08-16 20:45:52 -04:00
Guillermo del Angel d26183e0ec First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Mark DePristo 6a2862e8bc GSA-483: Bug in GATKdocs for Enums
-- Fixed to no long show constants in enums as constant values in the gatkdocs
2012-08-16 16:24:17 -04:00
Eric Banks 3253fc216b FindBugs 'Maintainability' fixes 2012-08-16 15:53:06 -04:00
Eric Banks 05cbf1c8c0 FindBugs 'Efficiency' fixes 2012-08-16 15:40:52 -04:00
Mark DePristo d8071c66ed Removing SlowGenotype object from GATK 2012-08-16 15:23:06 -04:00
Eric Banks a22e7a5358 Should've run 'ant clean' instead of just 'ant'. In any event, these are 2 cases where we are setting a class's internal static variable directly. Very dangerous. 2012-08-16 15:07:32 -04:00
Eric Banks 47b4f7b7e5 One final FindBugs related fix. I think it's safe to consider these changes 'fixes' that are allowed to go in during a code freeze. 2012-08-16 14:59:05 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks 611d9b61e2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-16 13:05:36 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Mark DePristo 132cdfd9c1 GSA-488: MLEAC > AN error when running variant eval fixed 2012-08-16 13:03:14 -04:00
Mark DePristo 4e42988c66 GSA-485: Remove repairVCFHeader from GATK codebase
-- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly.  Not a good idea, really.  Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)
2012-08-16 13:03:13 -04:00
Mark DePristo 52bfe8db8a Make sure the storage writer is closed before running mergeInfo in multi-threaded output management
-- It's not clear this is cause of GSA-484 but it will help confirm that it's not the cause
2012-08-16 13:03:13 -04:00
Mark DePristo 7a247df922 Added -bcf argument to VCFWriter output to force BCF regardless of file extension
-- Now possible to do -o /dev/stdout -bcf -l DEBUG > tmp.bcf and create a valid BCF2 file
-- Cleanup code to make sure extensions easier by moving to a setX model in VariantContextWriterStub
2012-08-16 13:03:13 -04:00
Mark DePristo 28c8e3e6d7 Cleanup BCF2Codec
-- Remove FORBID_SYMBOLIC global that is no longer necessary
-- all error handling goes via error() function
2012-08-16 13:03:13 -04:00
Mark DePristo 9dc694b2e9 Meaningful error message and keeping tmp file when mergeInfo fails
-- BCF2 is failing for some reason when merging tmp. files with parallel combine variants.  ThreadLocalOutputTracker no longer sets deleteOnExit on the tmp file, as this prevents debugging.  And it's unnecessary because each mergeInto was deleting files as appropriate
-- MergeInfo in VariantContextWriterStorage only deletes the intermediate output if an error occurs
2012-08-16 13:03:13 -04:00
Mark DePristo a9a1c499fd Update md5 in VariantRecalibrationWalkers test for BCF2 -- only encoding differences 2012-08-16 13:03:13 -04:00
Eric Banks f368e568db Implementing support in BaseRecalibrator for SOLiD no call strategies other than throwing an exception. For some reason we never transfered these capabilities into BQSRv2 earlier. 2012-08-15 22:52:56 -04:00
Eric Banks 9d09230c26 Better docs for verbose output of Pileup 2012-08-15 21:55:08 -04:00
Mark DePristo c0a31b2e5b CombineVariants parallel integration tests
-- All tests but one (using old bad VCF3 input) run unmodified with parallel code.
-- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers
-- Minor optimizations to simpleMerge
2012-08-15 21:13:16 -04:00
Mark DePristo 669c43031a BCF2 optimizations; parallel CombineVariants
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header.  Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils.  Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary.  We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
2012-08-15 21:13:16 -04:00
Mark DePristo dafa7e3885 Temporarily disable StateMonitoringThreadTests while I get them reliably working across platforms 2012-08-15 21:13:16 -04:00
Mark DePristo d70fd18900 Minor increase in tolerance to sum of states in UnitTest for StateMonitoringThreadFactory 2012-08-15 21:13:15 -04:00
Mark DePristo ae4d4482ac Parallel combine variants!
-- CombineVariants is now TreeReducible!
-- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format
2012-08-15 21:13:15 -04:00
Mark DePristo bd7ed0d028 Enable efficient parallel output of BCF2
-- Previous IO stub was hardcoded to write VCF.  So when you ran -nt 2 -o my.bcf you actually created intermediate VCF files that were then encoded single threaded as BCF.  Now we emit natively per thread BCF, and use the fast mergeInfo code to read BCF -> write BCF.  Upcoming optimizations to avoid decoding genotype data unnecessarily will enable us to really quickly process BCF2 in parallel
-- VariantContextWriterStub forces BCF output for intermediate files
-- Nicer debug log message in BCF2Codec
-- Turn off debug logging of BCF2LazyGenotypesDecoder
-- BCF2FieldWriterManager now uses .debug not .info, so you won't see all of that field manager debugging info with BCF2 any longer
-- VariantContextWriterFactory.isBCFOutput now has version that accepts just a file path, not path + options
2012-08-15 21:13:15 -04:00
Mark DePristo 9459e6203a Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Expanded unit tests
-- Support for clean logging of results to logger
-- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse
-- Added docs and contracts to StateMonitoringThreadFactory
2012-08-15 21:13:15 -04:00
Mark DePristo be3230a1fd Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Created makeCombinations utility function (very useful!).  Moved template from VariantContextTestProvider
-- UnitTests for basic functionality
2012-08-15 21:13:15 -04:00
Mark DePristo f277d7c09e Removing parallelism bottleneck in the GATK
-- GenomeLocParser cache was a major performance bottleneck in parallel GATK performance.  With 10 thread > 50% of each thread's time was spent blocking on the MasterSequencingDictionary object.  Made this a thread local variable.
-- Now we can run the GATK with 48 threads efficiently on GSA4!
  -- Running -nt 1 => 75 minutes (didn't let is run all of the way through so likely would take longer)
  -- Running -nt 24 => 3.81 minutes
2012-08-15 21:13:15 -04:00
Eric Banks 87e41c83c5 In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC. 2012-08-14 15:02:30 -04:00
Eric Banks 8e3774fb0e Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes. 2012-08-14 14:21:42 -04:00
Eric Banks 34b62fa092 Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC. 2012-08-14 12:54:31 -04:00
Eric Banks cfb994abd2 Trivial removal of ununsed variable (mentioned in resolved JIRA entry) 2012-08-13 22:55:02 -04:00
Khalid Shakir f809f24afb Removed SelectHeader's --include_reference_name option since the reference is always included.
In SelectHeaders instead of including the path to the file, only include the name of the reference since dbGaP does not like paths in headers.
2012-08-13 16:49:27 -04:00
Mark DePristo 6ad75d2f5c Reverting changes to BCF2 ranges
-- The previously expanded ones are actually the missing values in the range.  The previous ranges were correct.  Removed the TODO to confirm them, as they are now officially confirmed
2012-08-13 15:06:28 -04:00
Mark DePristo 4d3fad38e9 Increase allowable range for BCF2 by -1 on low-end 2012-08-13 14:20:26 -04:00
Mark DePristo aab417c94d Fix missing argument in unittest 2012-08-12 13:58:14 -04:00
Mark DePristo f032e0aba4 A bit better output for ContextCovariate context size logging 2012-08-12 13:45:52 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Mark DePristo 458bbdee8f Add useful logger.info telling us the mismatch and indel context sizes 2012-08-12 10:27:05 -04:00
Ami Levy Moonshine 6fefdaf428 "update integration tests in CombineVariantsIntegrationTest"
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-10 17:00:35 -04:00
Ami Levy Moonshine 4968daf0a5 update integration tests at CombineVariantsIntegrationTest 2012-08-10 16:58:05 -04:00
Eric Banks 40f0320a1c When adding a unit test to LIBS for X and = CIGAR operators, I uncovered a bug with the implementation of the ReadBackedPileup.depthOfCoverage() method. 2012-08-10 14:58:29 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Ami Levy Moonshine 68fb04b8f7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable into testing 2012-08-09 16:48:22 -04:00
Mark DePristo 06258c8a01 BCF2 optimizations
-- Added Write method to BCF2 types that directly converts int value to byte stream.  Deleted writeRawBytes(int)
-- encodeTypeDescriptor semi-inlined into encodeType so that the tests for overflow are done in just one place
-- Faster implementation of determineIntegerType for int[] values
2012-08-09 16:36:18 -04:00
Mark DePristo c6bd9b15ff BCF2 optimizations
-- BCF2Type enum has an overloaded method to read the type as an int from an input stream.  This gets rid of a case statement and replaces it with just minimum tiny methods that should be better optimized.  As side effect of this optimization is an overall cleaner code organization
2012-08-09 16:36:18 -04:00
Mark DePristo 9a0dda71d4 BCF2 optimizations
-- All low-level reads throw IOException instead of catching it directly.  This allows us to not try/catch in readByte, improving performance by 5% or so
-- Optimize encodeTypeDescriptor with final variables.  Avoid using Math.min instead do inline comparison
-- Inlined willOverflow directly in its single use
2012-08-09 16:36:18 -04:00
Ryan Poplin 9887bc4410 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 16:31:06 -04:00
Ryan Poplin f4c72a26d5 A few quick, minor findbugs fixes. 2012-08-09 16:30:58 -04:00
Ryan Poplin c7f22e410f A few quick, minor findbugs fixes. 2012-08-09 16:22:08 -04:00
Eric Banks def077c4e5 There's actually a subtle but important difference between foo++ and ++foo 2012-08-09 12:42:50 -04:00
Ryan Poplin e48727dae3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 10:31:10 -04:00
Guillermo del Angel 5be7e0621d Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 09:58:34 -04:00
Guillermo del Angel 71ee8d87b3 Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header 2012-08-09 09:58:20 -04:00
Eric Banks 35cec8530c Make coverage threshold in FindCoveredIntervals a command-line argument 2012-08-08 21:44:24 -04:00
Ryan Poplin 1223d77546 Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events 2012-08-08 15:13:20 -04:00
Eric Banks 0a2a646a52 Other random FindBugs fixes 2012-08-08 14:56:27 -04:00
Eric Banks 4c84cc9486 Quick pass of FindBugs 'should be static inner class' fixes. 2012-08-08 14:42:06 -04:00
Eric Banks a0196c9f5b Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes. 2012-08-08 14:34:16 -04:00
Eric Banks 4b2e3cec0b Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools. 2012-08-08 14:29:41 -04:00
Guillermo del Angel 3e2752667c Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts 2012-08-08 12:07:33 -04:00
David Roazen a7811d673f Update URL for phone home / GATK key documentation output by the GATK upon error 2012-08-08 09:29:54 -04:00
Mark DePristo cda8d944b7 Bugfixes for BCF with VQSR
-- Old version converted doubles directly from strings.  New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)).
-- getAttributeAsDouble() is now smart in converting integers to doubles as needed
-- Removed unnecessary logging info in BCF2Codec
-- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me
-- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test
2012-08-07 17:22:39 -04:00
Mark DePristo 80b94a4f9a AdaptiveContexts implement pruning to a given chi2 p value
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
2012-08-07 17:22:39 -04:00
Mark DePristo 982c735c76 VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty 2012-08-07 17:22:39 -04:00
Ryan Poplin 15085bf03e The UnifiedGenotyper now makes use of base insertion and base deletion quality scores if they exist in the reads. 2012-08-07 13:58:22 -04:00
Guillermo del Angel 97c5ed4feb Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 20:22:31 -04:00
Guillermo del Angel 238d55cb61 Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead 2012-08-06 20:22:12 -04:00
Mark DePristo 00858f16a6 Deleting empty unit test for AdaptiveContexts 2012-08-06 12:58:13 -04:00
Ryan Poplin f1c30c3a59 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 12:02:26 -04:00
Mark DePristo 44f160f29f indelGOP and indelGCP are now advanced, not hidden arguments 2012-08-06 11:42:55 -04:00
Mark DePristo 2f004665fb Fixing public -> private dep 2012-08-06 11:42:55 -04:00
Mark DePristo 7bf5ca51ee Major bugfix for adaptive contexts
-- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one.  Totally backward.  Updated the code to build the tree in the right direction.
-- Added a few more useful outputs for analysis (minPenalty and maxPenalty)
-- Misc. cleanup of the code
-- Overall I'm not 100% certain this is even the right way to think about the problem.  Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous.  Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?
2012-08-06 11:42:55 -04:00
Mark DePristo b4841548f1 Bug fixes and misc. improvements to running the adaptive context tools
-- Better output file name defaults
-- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate
-- Data is processed in qual order now, so it's easier to see progress
-- Logger messages explaining where we are in the process
-- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis
2012-08-06 11:42:55 -04:00
Ryan Poplin b8709d8c67 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 11:41:28 -04:00
Eric Banks 210db5ec27 Update -maxAlleles argument to -maxAltAlleles to make it more accurate. The hidden GSA production -capMaxAllelesForIndels argument also gets updated. 2012-08-06 11:31:18 -04:00
Eric Banks 8f95a03bb6 Prevent NumberFormatExceptions when parsing the VCF POS field 2012-08-06 11:19:54 -04:00
Ryan Poplin b7eec2fd0e Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly. 2012-08-05 12:29:10 -04:00
Mark DePristo e1bba91836 Ready for full-scale evaluation adaptive BQSR contexts
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
 -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
2012-08-03 16:02:53 -04:00
Guillermo del Angel 6f8e7692d4 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 12:24:37 -04:00
Guillermo del Angel 9e25b209e0 First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller 2012-08-03 12:24:23 -04:00
Ryan Poplin 8817fc70d1 Merged bug fix from Stable into Unstable 2012-08-03 10:45:01 -04:00
Ryan Poplin f40d0a0a28 Updating VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller. Integration tests change because of the MNPs in dbSNP. 2012-08-03 10:44:36 -04:00
Joel Thibault 51bd03cc36 Add RemoveProgramRecords annotation to ActiveRegionWalker 2012-08-03 09:54:16 -04:00
Joel Thibault addbfd6437 Add a RemoveProgramRecords annotation
* Add the RemoveProgramRecords annotation to LocusWalker
2012-08-03 09:54:16 -04:00
Joel Thibault 524d7ea306 Choose whether to keep program records based on Walker
* Add keepProgramRecords argument
* Make removeProgramRecords / keepProgramRecords override default
2012-08-03 09:54:16 -04:00
Mark DePristo e04989f76d Bugfix for new PASS position in dictionary in BCF2 2012-08-03 09:42:21 -04:00
Mark DePristo fb5dabce18 Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2
-- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results
-- Added new class BCFVersion that consolidates all of the version management of BCF
2012-08-02 17:30:30 -04:00
Eric Banks e3f89fb054 Missing/malformed GATK report files are user errors 2012-08-02 11:33:21 -04:00
Mark DePristo c3c3d18611 Update BCF2 to put PASS as offset 0 not at the end
-- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...
2012-08-01 17:09:22 -04:00
Mark DePristo ccac77d888 Bugfix for incorrect allele counting in IndelSummary
-- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs
-- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles
-- Update a few pieces of code to get previous correct behavior
-- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct)
-- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default
2012-08-01 15:45:12 -04:00
Joel Thibault 2b25df3d53 Add removeProgramRecords argument
* Add unit test for the removeProgramRecords
2012-08-01 15:33:05 -04:00
Ryan Poplin d53105668b Merged bug fix from Stable into Unstable 2012-08-01 14:53:06 -04:00
Ryan Poplin fabca66d09 Another fix to VQSR docs 2012-08-01 14:52:49 -04:00
Ryan Poplin 2be29ebd22 Merged bug fix from Stable into Unstable 2012-08-01 14:35:30 -04:00
Ryan Poplin 4093909a56 Updating VQSR docs. Removing references to old best practices pages. 2012-08-01 14:30:24 -04:00
Eric Banks 52b93cab62 Merged bug fix from Stable into Unstable 2012-08-01 13:17:36 -04:00
Eric Banks 22bf052828 Fixing BQSR GATK docs 2012-08-01 13:17:16 -04:00
Eric Banks 459832ee16 Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions as reported a while back on GS 2012-08-01 10:45:04 -04:00
Eric Banks a4a41458ef Update docs of FastaAlternateReferenceMaker as promised in older GS thread 2012-08-01 10:33:41 -04:00
Eric Banks 38e5419b11 Merged bug fix from Stable into Unstable 2012-08-01 09:50:31 -04:00
Eric Banks 56f8afab97 Requested by Geraldine: adding a utility to register deprecated walkers (and the major version of the first release since they were removed) so that the User Error printed out for e.g. CountCovariates now states: Walker CountCovariates is no longer available in the GATK; it has been deprecated since version 2.0. 2012-08-01 09:50:00 -04:00
Guillermo del Angel 0528337467 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-31 18:17:50 -04:00
Guillermo del Angel 4a23f3cd11 Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage 2012-07-31 16:34:20 -04:00
Eric Banks 6cb10cef96 Fixed older GS reported bug. Actually, the problem really lies in Picard (can't set max records in RAM without it throwing an exception, reported on their JIRA) so I just masked out the problem by removing this never-used argument from this rarely-used tool. 2012-07-31 16:00:36 -04:00
Eric Banks ab53d73459 Quick fix to user error catching 2012-07-31 15:50:32 -04:00
Eric Banks 10111450aa Fixed AlignmentUtils bug for handling Ns in the CIGAR string. Added a UG integration test that calls a BAM with such reads (provided by a user on GetSatisfaction). 2012-07-31 15:37:22 -04:00
Mark DePristo f7133ffc31 Cleanup syntax errors from BQSR reorganization 2012-07-31 08:11:05 -04:00
Mark DePristo dad9bb1192 Changes order of writing BaseRecalibrator results so that if R blows up you still get a meaningful tree 2012-07-31 08:11:04 -04:00
Mark DePristo 0c4e729e13 Working version of adaptive context calculations
-- Uses chi2 test for independences to determine if subcontext is worth representing.   Give excellent visual results
-- Writes out analysis output file producing excellent results in R
-- Trivial reformatting of MathUtils
2012-07-31 08:11:04 -04:00
Mark DePristo 93640b382e Preliminary version of adaptive context covariate algorithm
-- Works according to visual inspection of output tree
2012-07-31 08:11:04 -04:00
Mark DePristo 315d25409f Improvement to RecalDatum and VisualizeContextTree
-- Reorganize functions in RecalDatum so that error rate can be computed indepentently.  Added unit tests.  Removed equals() method, which is a buggy without it's associated implementation for hashcode
-- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data
-- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph
2012-07-31 08:11:04 -04:00
Mark DePristo 57b45bfb1e Extensive unit tests, contacts, and documentation for RecalDatum 2012-07-31 08:11:03 -04:00
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Mark DePristo 191294eedc Initial cleanup of RecalDatum for move and further refactoring
-- Moved Datum, the now unnecessary superclass, into RecalDatum
-- Fixed some obviously dangerous synchronization errors in RecalDatum, though these may not have caused problems because they may not have been called in parallel mode
2012-07-31 08:11:03 -04:00
Mark DePristo 0670316288 Be clearer that dcov 50 is good for 4x, should use 200 for >30x 2012-07-31 08:11:02 -04:00
Mark DePristo 874dbf5b58 Maximum wait for GATK run report upload reduced to 10 seconds 2012-07-31 08:11:02 -04:00
Guillermo del Angel e6b326c189 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 21:32:19 -04:00
Guillermo del Angel 6c9d3ec155 Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider 2012-07-30 21:32:03 -04:00
Ryan Poplin 7ed06ee7b9 Updating FindCoveredIntervals to use the changes to the ActiveRegionWalker. 2012-07-30 12:16:27 -04:00
Ryan Poplin 13591b169f Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 12:13:24 -04:00
Eric Banks 0b30588d67 Catch yet another class of User Errors 2012-07-30 11:59:56 -04:00
Eric Banks 5743694196 Merged bug fix from Stable into Unstable 2012-07-30 11:35:28 -04:00
Eric Banks 79195b97a3 Adding categories for the remaining uncategorized walkers 2012-07-30 11:35:08 -04:00
Guillermo del Angel 5b9a1af7fe Intermediate fix for pool GL unit test: fix up artificial read pileup provider to give consistent data. b) Increase downsampling in pool integration tests with reference sample, and shorten MT tests so they don't last too long 2012-07-30 09:56:10 -04:00
Eric Banks 7630c929a7 Re-enabling the unit tests for reverse allele clipping 2012-07-29 22:24:56 -04:00
Eric Banks b07bf1950b Adding an integration test for another feature that I snuck in during a previous commit: we now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them (this had been turned off because the previous version used Strings to do the uppercasing whereas we stick with byte operations now). 2012-07-29 22:19:49 -04:00
Eric Banks c4ae9c6cfb With the new Allele representation we can finally handle complex events (because they aren't so complex anymore). One place this manifests itself is with the strict VCF validation (ValidateVariants used to skip these events but doesn't anymore) so I've added a new test with complex events to the VV integration test. 2012-07-29 19:22:02 -04:00
Eric Banks 99b15b2b3a Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push. 2012-07-29 01:07:59 -04:00
Eric Banks 2b1b00ade5 All integration tests and VC/Allele unit tests are passing 2012-07-27 17:03:49 -04:00
Eric Banks beb7610195 Resolving merge conflicts 2012-07-27 15:52:02 -04:00
Eric Banks 27e7e11ec0 Allele refactoring checkpoint #3: all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this. 2012-07-27 15:48:40 -04:00
Ryan Poplin 22bb4804f0 HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region. 2012-07-27 12:44:02 -04:00
Ryan Poplin a0890126a8 ActiveRegionWalker's isActive function returns a results object now instead of just a double. 2012-07-27 11:01:39 -04:00
Eric Banks ef335b6213 Several more walkers have been brought up to use the new Allele representation. 2012-07-27 02:14:25 -04:00
Eric Banks 9e2209694a Re-enable reverse trimming of alleles in UG engine when sub-selecting alleles after genotyping. UG integration tests now pass. 2012-07-27 00:47:15 -04:00
Eric Banks baf3e33730 Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass. 2012-07-26 23:27:11 -04:00
Ryan Poplin 35e803e110 Merged bug fix from Stable into Unstable 2012-07-26 14:00:04 -04:00
Ryan Poplin 4f741b4cd7 Smoothing in the BQSR bins should be one error observation and one non-error observation. 2012-07-26 13:59:02 -04:00
Guillermo del Angel 2ae890155c Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy 2012-07-26 13:43:00 -04:00
Eric Banks a694d1b5de Merge branch 'master' into allelePadding 2012-07-26 01:53:14 -04:00
Eric Banks 32516a2f60 Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point. 2012-07-26 01:50:39 -04:00
Mark DePristo 8c418a15da Sorting out HMS error handling (fingers crossed)
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
2012-07-25 23:13:12 -04:00
Mark DePristo 9242f63a4d On the way to really sorting out HMS error handling
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
2012-07-25 22:11:10 -04:00
Mark DePristo 5671992db3 RMDTrackBuilderUnitTest now uses private/testdata file to avoid filesystem race conditions 2012-07-25 22:05:04 -04:00
Eric Banks 7eb3f54750 Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers. 2012-07-25 21:40:28 -04:00
Eric Banks 2982b24c4b Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-07-25 20:36:53 -04:00
Eric Banks 0a98a6aa8d Adding extraDocs tag per Mauricio's request 2012-07-25 18:23:18 -04:00
Mauricio Carneiro fce5cb9f35 Few category changes 2012-07-25 17:23:02 -04:00
Eric Banks 05fa377a8e Adding GATK categories to standard walkers. Will add to remaining walkers after the next successful release (so that I can see which walkers are public and still need it). 2012-07-25 16:05:47 -04:00
Mauricio Carneiro d46cf47bd1 Updating Read Filter documentation 2012-07-25 15:05:47 -04:00
Eric Banks 6a3bfa3811 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-07-25 14:11:11 -04:00
Eric Banks 357e0b35af Register GATK-full-only walkers and rethrow the missing walker error as a not supported in GATK lite error 2012-07-25 14:11:03 -04:00
Roger Zurawicki 5b74763096 Removed Categories.
We will use DocumentedGATKFeatures to create categories in our documentation. Eric I guess will be in charge of this. We need to remove walkers and think how to categorize everything.

Tools can be hidden from GATKdocs with the @Hidden annotation

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-25 13:46:24 -04:00
Eric Banks a5721a8846 Context covariate optimizations were not suited for multiple threads, so I removed them (since that ended up being much, much easier than trying to make the covariates thread local). Added -nt 2 layer to BQSR integration tests to confirm that it now works with multiple threads. 2012-07-25 13:38:07 -04:00
Eric Banks e0c07f5567 Reverting old commits that made error handling better because ultimately they made things worse. 2012-07-25 12:37:59 -04:00
Mark DePristo 16947e93f2 Integration test to ensure VariantFiltration makes . -> PASS/FAIL like VQSR
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:39 -04:00
Mark DePristo fcefa61bce Remove reference dependence in BCF2Codec
-- Adding BCF2Codec to VCF.jar and associated unit tests

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo 19a257a5c1 Multiple bugfixes
-- VariantFiltration now properly sets passFilters in VC
-- BCF2 writer now properly decodes lazy BCF genotype data that it uses.  Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug!

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo 3066894215 Bugfix for BCF2
-- Always decode genotypes block when writing out a BCF file.  If the header changes (and we currently don't know this easily) then the dictionary keys used in the genotypes block may be invalid.  Temporarily added a private static boolean that turns off writing of the blocks until Eric and his team rewrite the header.

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Guillermo del Angel eb55061fd0 a) Document BEAGLE codec, b) Bug fix: inbreeding coefficient shouldn't be computed for non-diploid organisms in current implementaiton 2012-07-24 12:16:15 -04:00
Mauricio Carneiro 348e86159e Moving doclets to public 2012-07-23 23:52:14 -04:00
Mauricio Carneiro 5cd98a36b9 Making ForumAPIUtils public 2012-07-23 17:44:24 -04:00
Mauricio Carneiro 3d92f041f3 forgot to delete the merging line 2012-07-23 17:35:07 -04:00
Roger Zurawicki f3c504769b Added the ability to update the Forum
GATKDocs looks for a key on gsa4, and updates the forum with new walker if it exists.
More changes were made to the GATKDocs. Works nicely with bootstrap on and offline.
Cleaned up the code as well

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-23 17:17:33 -04:00
Khalid Shakir 46ca49b63d Removed 'Walker' suffix from packages/GATKEngine.xml that were breaking the packaged release.
Archived AnalyzeCovariates scripts and removed references in build packages / GATK extensions.
2012-07-23 16:32:31 -04:00
Ryan Poplin 2a14bbe4f0 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-23 11:28:26 -04:00
Ryan Poplin 10d143c35c Adding error model header names in the BQSR recal plot. Making the downsampling of points look a little nicer. 2012-07-23 11:28:17 -04:00
Eric Banks 675ccab2fa Renaming BQSR to BaseRecalibrator 2012-07-23 10:17:17 -04:00
Ryan Poplin 2e486d83e2 Updating HaplotypeCaller docs and expanding integration tests. 2012-07-23 10:05:42 -04:00
Guillermo del Angel 39f45127f3 Fix md5's broken by recent changes to FisherStrand calculation 2012-07-21 14:41:38 -04:00
Mauricio Carneiro 65f4b67b86 Fixing walker unit test with the new naming convention 2012-07-20 17:50:29 -04:00
Mauricio Carneiro 921eaad33f Generalized the default platform parameter in BQSRv2
Parameter wasn't working outside of the BQSR walker. It now takes the information on the recalibration report in other tools (PrintReads for example) and treats all reads as coming from the defined default platform.
2012-07-20 17:29:13 -04:00
Mauricio Carneiro 5dc2143142 Removed support for walkers ending with "Walker" from the engine.
If your walker has "Walker" in the name, you will have to use "Walker" on the -T to access it.
2012-07-20 17:27:11 -04:00
Mauricio Carneiro d446d34227 GATK Error messages now point to the new website instead of GetSatisfaction. 2012-07-20 17:27:11 -04:00
Mauricio Carneiro 116885a450 Removed the "Walker" suffix from all walkers that had it.
* Did not touch archived walkers... those can be named whatever.
   * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
   * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
2012-07-20 17:27:11 -04:00
Christopher Hartl 3ee46cced2 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-19 21:25:40 -04:00
Christopher Hartl af383c30b5 Ensure that the gene summary has a header line 2012-07-19 21:24:04 -04:00
Mark DePristo 2ca5fc62a2 Support for MISSING BCF2 type
-- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid.  Updated our codebase to support this construct.  Heng said he'll update the BCF2 quick reference.
-- Enabled integration test reading Heng's ex2.bcf file
-- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK.  Turns out there's a single record in the 1000G SV call set that doesn't have the right length
-- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X
-- Added integration test reading 1000G SV chr1 calls (from Chris)
2012-07-19 16:14:26 -04:00
Guillermo del Angel c16f9f2f15 a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress 2012-07-19 10:24:08 -04:00
Eric Banks 5f5edeca63 Reverting move of BQSR tests to public, as per DR's email 2012-07-19 10:02:05 -04:00
Eric Banks e370030e6c As requested by Mark, I've broken out the code to pull out the protected subclass when available (and otherwise use the public version) into the GATKLiteUtils class. People should use this code instead of reimplementing all of the java reflection on their own. 2012-07-18 22:44:37 -04:00
Eric Banks d46ccec04e Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us. 2012-07-18 21:48:58 -04:00
Eric Banks 9c1ab1b0c0 Move BQSR integration test and its dependent files into public; previously there was a protected->private dependency. 2012-07-18 21:11:33 -04:00
Mark DePristo 994c5c31c1 Enabling VariantEval integration tests for ValidationReport 2012-07-18 16:07:47 -04:00
Mark DePristo 74e153ff4a FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Mark DePristo dede3a30e9 Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
2012-07-18 16:07:47 -04:00
Mark DePristo 559a4826be Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
2012-07-18 16:07:46 -04:00
Mark DePristo dc292c0317 FisherStrand now includes all reads and bases, regardless of mapping quality and base quality, just like other annotations
-- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.
2012-07-18 16:07:46 -04:00
Eric Banks 2c0f073ab1 Make -qq arg hidden for now since it's still very experimental 2012-07-18 15:43:25 -04:00
Eric Banks b46c85e8b4 More bad BAM file catching 2012-07-18 15:26:31 -04:00
Eric Banks 659eee13a6 Handle NPE generated in UG when non-standard reference bases are present in the fasta 2012-07-18 15:16:27 -04:00
Eric Banks 9af2cfe283 Catch underlying file system problems that get masked as Tribble index errors. There's also a quick patch to the HMS that isn't really the ultimate fix needed; Mark and I will review at a later point. 2012-07-18 15:11:38 -04:00
Eric Banks 4c730542f0 Handle RuntimeExceptions thrown by Picard that are really User Errors. I will add unit tests for these as best I can later. 2012-07-18 13:56:35 -04:00
Eric Banks ae08d35138 Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly). 2012-07-18 12:57:34 -04:00
Eric Banks f2fe59a9d4 Wow, there are a ton of errors captured having to do with being unable to merge the temp Tribble output. I'm expanding the error message a bit to help see if we can do anything going forward. 2012-07-18 12:31:59 -04:00
Eric Banks e4db8dde91 Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0! 2012-07-17 23:36:43 -04:00
Eric Banks a8d08ea18d As a user pointed out, it is not valid for a GenomeLoc to have a start or stop equal to 0. 2012-07-17 22:18:43 -04:00
Guillermo del Angel 29273abab7 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 16:58:12 -04:00
Guillermo del Angel 731bbba2e6 Bug fixes for integration test, use correct new UG syntax 2012-07-17 16:57:59 -04:00
Eric Banks 33be41ecf5 Cleaning up integration test 2012-07-17 16:06:04 -04:00
Eric Banks 8dbc9cb29c Add the ability to emit the original quals in the OQ tag 2012-07-17 15:52:56 -04:00
Guillermo del Angel 40b8c7172c Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to. 2012-07-17 15:27:04 -04:00
Laurent Francioli 68d0e4dd6d - Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE
-Updated integrationtests to reflect changes in MVF file output

Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-07-17 15:21:10 -04:00
Eric Banks b0d99fd10d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 15:12:28 -04:00
Eric Banks 305db8c0d1 Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us. 2012-07-17 15:11:03 -04:00
Ryan Poplin 6efbcd99f1 HaplotypeCaller is now an AnnotatorCompatibleWalker with all the rights and privileges pertaining thereto. Enabling the ClippingRankSumTest after showing it was useful for 1000 Genomes calling. 2012-07-17 14:38:36 -04:00
Eric Banks 110886e8b9 Oops, got the logic wrong. 2012-07-17 13:37:11 -04:00
Eric Banks a963b37424 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 13:15:37 -04:00
Eric Banks 3a64398d07 Cleaned up the isGATKLite check 2012-07-17 12:46:16 -04:00
Eric Banks 62c5228048 1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method. 2012-07-17 12:23:40 -04:00
Chris Saunders 1913d1bbd0 Put RunReport S3 upload on timeout thread
Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue.

Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
2012-07-17 12:19:39 -04:00
Eric Banks 40618ac471 A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots). 2012-07-17 10:52:43 -04:00
Eric Banks d5b3a2eabf Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 00:32:53 -04:00
Eric Banks f657b8bda8 Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow. 2012-07-17 00:32:34 -04:00
Eric Banks a003148d50 Move AnalyzeCovariates over too. 2012-07-16 16:11:56 -04:00
Eric Banks 0a89adbcdb Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one). 2012-07-16 15:34:50 -04:00
Eric Banks 52baac1e16 Move BQSRv2 into public and v1 into the archive. 2012-07-16 14:23:38 -04:00
Khalid Shakir 07822d6c0f Fixed input annotations for master/test files on DiffObjectsWalker. 2012-07-16 13:33:11 -04:00
Eric Banks 2a830939df Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-14 23:49:59 -04:00
Eric Banks f29cadd7e2 By default, don't quantize quals in BQSRv2 2012-07-14 23:49:48 -04:00
Eric Banks 75543a3f22 ReadClipper.clipRead's claim that it doesn't modify the original read was false. Ultimately, GATKSAMRecord.clone (as documented) creates a soft copy of the read - so modifying e.g. the bases of the cloned read means that you modify the bases of the original read too. Because of this, when the BQSRv2 Context covariate was writing Ns over the low quality tails of the reads they got propagated out to the output BAM file (very bad). I've updated the ReadClipper docs and cleaned up the code (no reason to use a clone of the read anymore given that we are already modifying the original). For now, the simplest thing is to have the Context covariate store the original bases, overwrite low quality Ns, compute covariates, and rewrite the original bases; we can update later if needed. 2012-07-13 18:50:27 -04:00
Ryan Poplin 443f02ffc2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-13 16:09:24 -04:00
Khalid Shakir 6dfcc486e8 In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters(). 2012-07-13 15:40:56 -04:00
Ami Levy Moonshine 5d0a7335ea remove unnecessary use in the PRIORITY list
remove unneeded imports
2012-07-13 15:27:08 -04:00
Ryan Poplin d70bb59182 HaplotypeCaller now calls insertion events that aren't fully assembled as symbolic alleles. 2012-07-10 14:22:23 -06:00
Guillermo del Angel 279dff9f81 Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants 2012-07-10 13:59:00 -04:00
Mauricio Carneiro 7eb45b4038 Fixed BQSR IntegrationTests
* BinaryTag covariate is Experimental, not Standard (this was breaking integration tests)
   * New parameter in the Recalibration report requires new MD5 for one of the integration tests.
2012-07-09 13:55:12 -04:00
Eric Banks dd0c47ab7e Don't cast to a specific walker type since any walker can use the VA engine 2012-07-09 10:25:58 -04:00
Mark DePristo 5b0ade67c8 Updates to VCF processing for better BCF processing
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order].  Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally.  The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
2012-07-08 15:44:33 -07:00
Mark DePristo 63f5262e45 mergeInfoWithMaxAC is no longer hidden in CombineVariants 2012-07-08 15:44:32 -07:00
Mark DePristo 66aee613e2 Bugfix for set key in mergeInfoWithMaxAC.
-- Previous version was always setting set=source of info with highest AC.  Should actually have been set to the set annotation value itself.
2012-07-08 15:44:32 -07:00
Mark DePristo 91f0ed8059 Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available 2012-07-08 15:44:32 -07:00
Mark DePristo 87b090c362 Update VariantRecalibator error message to use -resource not old -B syntax 2012-07-08 15:44:31 -07:00
Mauricio Carneiro 125e6c1a47 added BinaryTagCovariate for ancient dna analysis 2012-07-06 15:03:20 -04:00
Mauricio Carneiro e93b025b39 Fixing unit test
with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.
2012-07-06 12:08:09 -04:00
Mauricio Carneiro f603d4c48c Fixing PairHMMIndelErrorModel boundary issue
When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.
2012-07-06 11:48:04 -04:00
Eric Banks dd571d9aa0 Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags. 2012-07-04 01:22:20 -04:00
Eric Banks 33306d2e20 Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off. 2012-07-04 00:21:21 -04:00
Eric Banks 7d30558e6f Only 'pad' the cycle covariate for indels, not substitutions 2012-07-03 23:47:01 -04:00
Mauricio Carneiro 17efbbf8b1 Fixed ReadClipperUnitTest
The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.
2012-07-03 16:38:51 -04:00
Eric Banks 22f1afddaa Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 14:55:59 -04:00
Eric Banks 617eebd204 More misc cleanup 2012-07-03 14:55:37 -04:00
Eric Banks 344c3aeb1d Cleanup from previous commit 2012-07-03 14:42:44 -04:00
Ryan Poplin 9e8e78de15 Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels. 2012-07-03 14:30:37 -04:00
Eric Banks 0b37d44b0d Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup. 2012-07-03 13:05:11 -04:00
Eric Banks 031322ff00 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 00:12:59 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Ryan Poplin f92139dd82 Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now. 2012-07-02 21:12:42 -04:00
Ryan Poplin 7e7b4cd1b9 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 16:37:54 -04:00
Ryan Poplin b807ff63ef HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test. 2012-07-02 16:37:39 -04:00
Mauricio Carneiro 3cea080aa8 Cache SoftStart() and SoftEnd() in the GATKSAMRecord
these are costly operations when done repeatedly on the same read.
2012-07-02 16:22:00 -04:00
Mauricio Carneiro 88a02fa2cb Fixing but for reads with cigars like 9S54H
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
2012-07-02 16:22:00 -04:00
Mark DePristo 1b0a775773 Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest 2012-07-02 15:55:42 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Mark DePristo 602729c09d Moved parallel tests from SelectVariants to separate SelectVariantsParallelIntegrationTest
-- Enabled previous tests -- all now working
-- Added modern test against new VCF as well
2012-07-02 11:40:28 -04:00
Mark DePristo bcd2e13d8b Adding duplicate header line keys is a logger.debug not logger.warn message now 2012-07-02 11:39:34 -04:00
Mark DePristo 01e04992f8 Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output 2012-07-02 11:38:59 -04:00
Eric Banks c94c8a9c09 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 08:53:01 -04:00
Mark DePristo 7aff4446d4 Added unit tests for header repairing capabilities in the GATK engine 2012-07-01 15:38:10 -04:00
Mark DePristo 480b32e759 BCF2 is now officially zero-based open-interval, and that's how the GATK does it now 2012-07-01 14:59:27 -04:00
Ryan Poplin b6093ff02c Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-01 10:32:37 -04:00
Mark DePristo 9b87dcda4f Fixing remaining integration test errors. Adding missing ex2.bcf 2012-06-30 16:23:11 -04:00
Mark DePristo 5ad9a98a15 Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
2012-06-30 11:22:49 -04:00
Mark DePristo 385a3c630f Added check in VariantContext.validate to ensure that getEnd() == END value when present
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo 893630af53 Enabling symbolic alleles in BCF2
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication.  Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper.  Actually documented this code, which results in lots of **** negative comments on the code quality.  Eric has promised that he and Ami are going to rethink this code from scratch.  Fixed many nasty bugs in here, cleaning up unnecessary branches, etc.  Added UnitTests in VCFAlleleClipper that actually test the code full.  In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed.  Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason.  Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo 16276f81a1 BCF2 with support symbolic alleles
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00
Mark DePristo 86feea917e Updating MD5s to reflect new FT fixed count of 1 not UNBOUNDED 2012-06-30 11:22:47 -04:00
Mark DePristo 6bea28ae6f Genotype filters are now just Strings, not Set<String> 2012-06-30 11:22:47 -04:00
Guillermo del Angel f631be8d80 UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function 2012-06-29 23:51:52 -04:00
Guillermo del Angel 65037b87da Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-29 11:08:44 -04:00
Guillermo del Angel 5a9a37ba01 Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb) 2012-06-29 11:08:16 -04:00
Eric Banks 96ea334bf2 Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit. 2012-06-28 15:27:44 -04:00
Ryan Poplin 05791ebf80 Adding the Clipping rank sum test: If alternate-supporting reads have more hard clipping than reference-supporting reads this is evidence for error. 2012-06-28 13:22:56 -04:00
Ryan Poplin d12ec92a55 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-28 12:57:59 -04:00
Ryan Poplin 5bb0693888 Bug fix for HC GGA mode. Shouldn't try to add an indel into the haplotype if that haplotype already contains the event of interest. Misc minor assembly param changes. Turning off capping of base qualities by base indel qualities until we can evaluate that change. 2012-06-28 12:57:51 -04:00
Khalid Shakir 1ce0b9d519 Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified. 2012-06-28 11:28:04 -04:00
Mark DePristo 734bb5366b Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples
-- numLikelihoods no longer allows even ploidy == 0 in requires
-- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2
2012-06-28 10:06:07 -04:00
Mark DePristo 064cc56335 Update integration tests to reflect new FT header line standard and new DiagnoseTargets field names 2012-06-28 10:06:06 -04:00
Mark DePristo 64d7e93209 Massive bugfixes
-- Previous version was reading the size of the encoded genotypes vector for each genotype.  This only worked because I never wrote out genotype field values with > 15 elements.  Mauricio's killer DiagnoseTargets VCF uncovered the bug.  Unfortunately since symbolic allele clipping is still busted those tests are still diabled
-- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.
2012-06-28 10:06:06 -04:00
Mark DePristo 7144154f53 VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header
-- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously.
-- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects
-- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC
-- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function
2012-06-28 10:06:06 -04:00
Mark DePristo 4811a00891 GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine 2012-06-28 10:06:05 -04:00
Mark DePristo 93426a44b1 Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint
-- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int
-- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK.  Genotype filters should be empty for PASSing sites
2012-06-28 10:06:05 -04:00
Eric Banks dc7636b923 Refactor the ContextCovariate to significantly reduce runtime 2012-06-28 02:29:35 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Khalid Shakir 746a5e95f3 Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding.
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
2012-06-27 01:15:22 -04:00
Mark DePristo 016b25be87 Update annoying md5s in unit tests, also failing because of header fixing 2012-06-26 17:32:42 -04:00
Mark DePristo cd32b6ae54 CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect 2012-06-26 17:32:42 -04:00
Mark DePristo 1f45551a15 Bugfixes to G count types in VCF header
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
2012-06-26 15:28:34 -04:00
Mark DePristo 7ef5ce28cc VariantRecalibrator test currently doesn't work with shadowBCF 2012-06-26 15:28:34 -04:00
Mark DePristo 5f5885ec78 Updating many MD5s to reflect correct fixed headers
-- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader.  Updating md5s to reflect this.
2012-06-26 15:28:34 -04:00
Mark DePristo 39c849aced Bugfix to ensure the DB=1 old files decode properly 2012-06-26 15:28:33 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 0b5980d7b3 Added Heng's nasty ex2.vcf to standard tests 2012-06-26 15:28:33 -04:00
Mark DePristo 11dbfc92a7 Horrible bugfix to decodeLoc() in BCF2Codec
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
2012-06-26 15:28:32 -04:00
Mark DePristo fb26c0f054 Update integration tests to reflect header changes 2012-06-26 15:28:32 -04:00
Mark DePristo 7b96263f8b Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet 2012-06-26 15:28:32 -04:00
Mark DePristo 7dbba465ee Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf 2012-06-26 15:28:32 -04:00
Mark DePristo 6e9a81aabe Minor bugfix -- now that the testfile is in our testdata regenerate the idx file as needed to pass tests 2012-06-26 15:28:32 -04:00
Roger Zurawicki 7eb3e4da41 Added integration Tests for DiagnoseTargets
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-25 17:02:46 -04:00
Joel Thibault f0c54d99ed Account for a null attributes object
* field attributesCanBeModified - a null attributes object can't be modified in its current state
* method makeAttributesModifiable() - initialize a null attributes object to empty
2012-06-25 12:07:36 -04:00
Joel Thibault d0cf8bcc80 Add unit tests for VariantContextBuilder.rmAttribute() and .attribute()
* These generated NPEs when the attribute object is null
2012-06-25 12:05:04 -04:00
Joel Thibault fd9effbfe2 Fix Exception typo 2012-06-25 12:05:04 -04:00
Ryan Poplin 429ad44421 Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start. 2012-06-22 14:53:29 -04:00
Ryan Poplin 735b59d942 Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero. 2012-06-22 12:38:48 -04:00
Ryan Poplin 0650b349d7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-22 10:42:49 -04:00
Guillermo del Angel eed32df30d a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this 2012-06-21 21:19:55 -04:00
Mark DePristo d17369e0ac A few misc. residual errors in last commit 2012-06-21 16:04:25 -04:00
Mark DePristo 734756d6b2 Final fixes before BCF2 mark III push
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
2012-06-21 15:17:22 -04:00
Mark DePristo 31ee8aa01a JEXL update
-- Update to 2.1.1 from 2.0
-- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching.  So "AF < 0.5" works even in the presence of multi-allelics now.
--
2012-06-21 15:17:21 -04:00
Mark DePristo 549293b6f7 Bugfixes towards final BCF2 implementation
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
2012-06-21 15:17:21 -04:00
Mark DePristo 567dba0f76 Cleanup of VCF header lines and constants, BCF2 bugfixes
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place.  Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata).  Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
2012-06-21 15:16:31 -04:00
Mark DePristo fba7dafa0e Finalizing BCF2 mark III commit
-- Moved GENOTYPE_KEY vcf header line to VCFConstants.  This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header.  Still doesn't work...
-- Updating integration test files.  Moved many more files into public/testdata.  Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine.  DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code.  Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly.  UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian.  Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header.  How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values.  Genotype objects are all PASS implicitly, or explicitly filtered.  We only write out the FT values if at least one sample is filtered.  Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works.  In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
2012-06-21 15:16:27 -04:00
Mark DePristo 0c8b830db7 Updating MD5s for inclusion of RPA field header 2012-06-21 15:16:26 -04:00
Mark DePristo d015a5738d Bugfixes for VCFWriterUnitTest and TestProvider to deal with stricter VCFWriter behavior 2012-06-21 15:16:26 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mauricio Carneiro ab53220635 Refactor on how RR treats soft clips
* Sites with more soft clipped bases than regular will force-trigger a variant region
   * No more unclipping/reclipping, RR machinery now handles soft clips natively.
   * implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
   * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.

note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
2012-06-21 14:02:03 -04:00
Ryan Poplin 769e190202 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-20 09:59:55 -04:00
Christopher Hartl fe1d6e3953 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-19 08:02:00 -04:00
Christopher Hartl 79ef3325bd Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations. 2012-06-19 07:50:13 -04:00
Eric Banks 15ae906f32 Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today. 2012-06-18 21:54:58 -04:00
Eric Banks 62cee2fb5b Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature. 2012-06-18 21:36:27 -04:00
Eric Banks 4393adf9e7 If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it. 2012-06-18 13:36:14 -04:00
Ryan Poplin 707151f0a4 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 12:55:58 -04:00
Eric Banks 82a2c40338 Emit the MLE AC and AF in the INFO field of the UG output 2012-06-18 12:19:36 -04:00
Ryan Poplin 5ec737f008 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 08:51:48 -04:00
Ryan Poplin e3147969d9 Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled. 2012-06-18 08:51:41 -04:00
Eric Banks 677babf546 Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort. 2012-06-15 15:55:03 -04:00
Eric Banks 783b7f6899 Misc cleanup 2012-06-15 10:39:19 -04:00
Eric Banks 0c218e4822 Refactoring mostly for readability (and small performance improvement) 2012-06-15 10:36:41 -04:00
Eric Banks c54e84e739 Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations. 2012-06-15 09:28:56 -04:00
Eric Banks 61fcbcb190 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-15 02:45:57 -04:00
Eric Banks 4895fe2289 No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java) 2012-06-15 02:32:00 -04:00
Mark DePristo 5c23ab0817 Final cleanup of VCFWriterUnitTest 2012-06-14 16:42:39 -04:00
Mark DePristo 0384ce5d34 Simple optimizations for BCF2Encoder
-- Inline encodeString that doesn't go via List<Byte> intermediate
-- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2
-- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation
-- Final UG integration test update
2012-06-14 16:42:39 -04:00
Mark DePristo 68eed7b313 Optimizations for VCF and BCF2
-- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking.  encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version.  Lots of contracts.  User code updated to use specialized versions where possible
-- Misc code refactoring
-- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1.  Updating MD5s accordingly
-- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations
2012-06-14 16:42:39 -04:00
Mark DePristo 09df584788 Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter 2012-06-14 16:42:39 -04:00
Mark DePristo fbc45e14d3 Cleanup formatting of VCF floats
-- Final integrationtest update before commit (and fixing new formatting changes)
2012-06-14 16:42:38 -04:00
Mark DePristo 8b01969762 More code cleanup and optimizations to BCF2 writer
-- Cleanup a few contracts
-- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields
-- A few simple optimizations
    -- VCF header samples stored in String[] in the writer for fast access
    -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset
    -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files
-- A few editorial comments in VCFHeader
2012-06-14 16:42:38 -04:00
Mark DePristo e34ca0acb1 Passing all unittests
-- Final merge conflicts resolved
-- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that
2012-06-14 16:42:38 -04:00
Mark DePristo 71da76039e Final support for variable length lists of strings in BCF2
-- Updating many MD5s as well.
2012-06-14 16:42:38 -04:00
Mark DePristo bd9d40fb84 Code cleanup and more documentation for BCFFieldWriters
-- Update integration tests where appropriate
2012-06-14 16:42:37 -04:00
Mark DePristo dc07067265 Fix bug in incorrectly reporting relative paths in log 2012-06-14 16:42:37 -04:00
Mark DePristo 856905ee5b Cleanup Genotypes
-- Renamed getAttribute to getExtendedAttribute, as this is really what this function does
-- Added a few more genotype tests
2012-06-14 16:42:36 -04:00
Mark DePristo aa2178cc68 Updating MD5s to latest version to reflect inclusion of contigs in headers 2012-06-14 16:42:36 -04:00
Mark DePristo 31997f8092 Bugfixes on the way to passing integration tests
-- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate
-- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it.  Very expensive but convenient.
-- Fixed nasty subsetting bug in SelectVariants with excluding samples
-- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT
-- Bugfix for dropping old style GL field values
-- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header
-- Bugfix for Allele.getBaseString to properly show NO_CALL alleles
-- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes
2012-06-14 16:42:33 -04:00
Mark DePristo ea1b699778 Cleanup the interface for BCF2FieldEncoder
-- Now uses a much clearer approach.  Update all user classes to new interface
2012-06-14 16:42:33 -04:00
Mark DePristo dd6aee347a Genotype encoding uses the BCF2FieldEncoder system 2012-06-14 16:42:33 -04:00
Mark DePristo 9ac4203254 GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value
-- Cleans up interface and avoids significant amounts of gross typing code
2012-06-14 16:42:32 -04:00
Mark DePristo 7506994d09 Nearing final BCF commit
-- Cleanup some (but not all) VCF3 files.  Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec.  Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
2012-06-14 16:42:32 -04:00
Mark DePristo 6272612808 Testing utility to perform diffs N times 2012-06-14 16:42:32 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo 51a3b6e25e No more makePrecisionFormatStringFromDenominatorValue
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context.  Replaced with a faster call that I think is correct. Ryan, please confirm.
2012-06-14 16:42:30 -04:00
Mark DePristo 43ad890fcc Finalizing BCF2 v2
-- FastGenotypes are the default in the engine.  Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec.  Good error handling.  Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
    -- Support for writing a sites only file
    -- Lots of TODOs for future optimizations
    -- Removed lack of filter field support
    -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
    -- Lots of docs and contracts
-- Docs for GenotypeBuilder.  More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters.  Better genotype comparisons
2012-06-14 16:42:29 -04:00
Mark DePristo 37e5d32019 Remove logger.info statement 2012-06-14 16:42:29 -04:00
Mark DePristo 6cfb2d1393 Restoring SelectVariantsIntegrationTest 2012-06-14 16:42:28 -04:00
Mark DePristo 01ddf9555a Performance optimizations for Genotype field decoding for GT field
-- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over
-- Contracts
-- final classes
2012-06-14 16:42:28 -04:00
Mark DePristo 7fbca7013e Don't add missing value binding from field to Genotype object in VCF3Codec 2012-06-14 16:42:28 -04:00
Mark DePristo cfd1e50068 Minor updates to test code 2012-06-14 16:42:28 -04:00
Mark DePristo 54817f8d16 VCFHeaderUnitTest needed to be updated to reflect that we are doing VCF4.1 not VCF4.0 2012-06-14 16:42:28 -04:00
Mark DePristo 982192e2e4 MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis
-- This file is in integrationtests/md5mismatches.txt, and looks like:

expected        observed        test
7fd0d0c2d1af3b16378339c181e40611        2339d841d3c3c7233ebba9a6ace895fd        test BeagleOutputToVCF
43865f3f0d975ee2c5912b31393842f8        1b9c4734274edd3142a05033e520beac        testBeagleChangesSitesToRef
daead9bfab1a5df72c5e3a239366118e        27be14f9fc951c4e714b4540b045c2df        testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e

-- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods
2012-06-14 16:42:27 -04:00
Mark DePristo 249d5e5533 Better tests for Genotype parsing 2012-06-14 16:42:27 -04:00
Mark DePristo 4a4d3cde3d UnitTests for decodeIntArray method 2012-06-14 16:42:27 -04:00
Mark DePristo 5b8bd81991 An option to not actually write out the results of select variants
-- Useful for performance testing of the SV operations themselves.
2012-06-14 16:42:26 -04:00
Mark DePristo 6f7a01e00d Bugfix for BCF2 reader / writer for > 0x0FFF samples :-)
-- Should be 0x00FFFFFF in the mask
2012-06-14 16:42:26 -04:00
Mark DePristo 1d4eb46606 Efficient reading of genotype fields v1
-- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object
-- Code cleanup / contracts added were appropriate
-- V2 will have a yet more optimized path...
2012-06-14 16:42:26 -04:00
Mark DePristo 37b8d70321 Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC 2012-06-14 16:42:25 -04:00
Mark DePristo 17fbd103d0 Smarter infrastructure to decode genotypes in BCF
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder.  The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder.  In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine.  The DiffEngine algorithm needs to be rethought to be efficient...
2012-06-14 16:42:25 -04:00
Mark DePristo 889e3c4583 Code cleanup before major refactor 2012-06-14 16:42:25 -04:00
Mark DePristo cebd37609c Finalizing new Genotype object and associated routines
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
2012-06-14 16:42:25 -04:00
Mark DePristo 8b0a629a31 Terrible bugfix
-- The way I was handling the contig offset ordering wasn't correct.  Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict.  It has nothing to do with the contig index itself.
-- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly
2012-06-14 16:42:24 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo ff9ac4b5f8 BCF2 genotype decoding is now lazy
-- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec.
-- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec
2012-06-14 16:42:23 -04:00
Mark DePristo 9eb83a0771 Enable adding contigs to VariantContextWriters on output 2012-06-14 16:42:23 -04:00
Mark DePristo 8fc1a26ac7 Fixed comparison of VCFHeader as the set.equals() isn't working as expected 2012-06-14 16:42:22 -04:00
Mark DePristo b0ea14ef0f VCFHeader getMetaData returns 4.1 version not 4.0 2012-06-14 16:42:22 -04:00
Mark DePristo 5fda16bea9 Enable shadow BCF2 2012-06-14 16:42:22 -04:00
Mauricio Carneiro 7d12429917 First step towards indel qualities in RR
Let the BI's and BD's pass through the reduce reads machinery
2012-06-14 15:37:39 -04:00
Mauricio Carneiro e68038c5d8 Refactor post-processing downsampling using David's generic downsampler interface 2012-06-14 15:37:32 -04:00
Eric Banks 0398ae9695 I hate these disabled unit tests, #2 2012-06-14 15:19:27 -04:00
Eric Banks 676a57de7b I hate these disabled unit tests 2012-06-14 14:03:58 -04:00
Eric Banks de5508fcea Bug fixes for cycle and context covariates 2012-06-14 13:01:14 -04:00
Eric Banks 5c3c6cbc40 Long -> long conversions in BQSR 2012-06-14 09:07:02 -04:00
Eric Banks 29a74908bb The next round of BQSR optimizations: no more Long[] array creation 2012-06-14 00:05:42 -04:00
Guillermo del Angel cd2074b1dc Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 20:59:30 -04:00
Guillermo del Angel 92669a0468 Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet 2012-06-13 20:59:17 -04:00
David Roazen 0550b27799 Make downsampler classes themselves generic (instead of just the Downsampler interface)
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.

Thanks to Khalid for his feedback on this issue.

Also:

-added a unit test to verify GATKSAMRecord support with no typecasting required

-added some unit tests for the FractionalDownsampler that Mauricio will/might be using

-moved classes from private to public to better sync up with my local development
branch for engine integration
2012-06-13 16:43:39 -04:00
Guillermo del Angel 67c0569f9c Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 11:50:00 -04:00
Eric Banks 81993b08e2 Don't put null entries into the key array 2012-06-13 11:43:44 -04:00
Roger Zurawicki bdf5945dcc Fixed bugs in DiagnoseTargets
DT would not report bad mates!
that has been fixed

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:26 -04:00
Roger Zurawicki 538cdf9210 Created the FindCoveredIntervals
Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class
Minor tweaks
FindCoveredIntervals supports Gathering
FindCoveredIntervals outputs an interval list instead of GATKReport

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:25 -04:00
Guillermo del Angel aee66ab157 Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact 2012-06-13 11:14:44 -04:00
Eric Banks bb77aa88c3 Drat, forgot the unit tests again 2012-06-12 19:00:47 -04:00
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 277493dd83 Yet more instances of Lists changed over to native arrays 2012-06-12 15:56:09 -04:00
Eric Banks 613badc835 Very minor optimizations for the context covariate 2012-06-12 15:47:32 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks 1da3e43679 Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come. 2012-06-12 13:32:56 -04:00
Eric Banks a96c5da884 Oops, forgot to push the unit tests 2012-06-12 11:38:30 -04:00
Eric Banks fec0bd5e11 Fixing UG argument docs 2012-06-12 09:46:16 -04:00
Eric Banks a4defdfb29 Adding a GT header line to SomaticIndelDetector output 2012-06-12 09:39:17 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Eric Banks ff5749599d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 15:46:17 -04:00
Eric Banks fea625632f Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one 2012-06-11 15:45:58 -04:00
Ryan Poplin e4d371dc80 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 10:38:50 -04:00
Ryan Poplin 683d4b508e Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller. 2012-06-11 10:38:35 -04:00
Mauricio Carneiro 4aad7e23ef New ReduceReads v2 with unclipped variant regions and soft-clipped bases
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
   * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
   * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
   * Updated all integration tests
2012-06-08 14:58:31 -04:00
Eric Banks afa9b2718a Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-08 13:54:48 -04:00
Eric Banks 92280b4068 BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime. 2012-06-08 13:54:37 -04:00
Eric Banks 898a0e6161 Minor optimizations 2012-06-08 12:07:58 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks d463ab2cbf BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible. 2012-06-08 10:42:42 -04:00
Eric Banks 2bd48a7351 Bad comments made it into the previous commit 2012-06-07 23:12:56 -04:00
Eric Banks 31c3a6be48 BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR. 2012-06-07 20:04:10 -04:00
Eric Banks 0fb9179f76 BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array 2012-06-07 19:41:03 -04:00
Ryan Poplin d449f169d3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-07 10:56:55 -04:00
Ryan Poplin 0b4281fdd0 misc minor update to HC debug output for when there are a lot of samples 2012-06-07 10:56:41 -04:00
Eric Banks bad50a1b05 Fix docs 2012-06-06 22:45:38 -04:00
Eric Banks b093ba9dcc Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records) 2012-06-06 15:17:30 -04:00
Eric Banks 54f682a99c Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX. 2012-06-06 11:44:37 -04:00
Eric Banks dd46d843fb IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion. 2012-06-06 11:04:55 -04:00
Guillermo del Angel 2cbd6e5f90 Merged bug fix from Stable into Unstable 2012-06-05 15:58:23 -04:00
Guillermo del Angel ce4dc2128d Adding minor clarification to -mbq argument documentation 2012-06-05 15:17:56 -04:00
Eric Banks e02ec8c8b6 Don't update the record ID unless we are actually going to emit the record 2012-06-04 14:58:50 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin f11e7ebc3a Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA. 2012-06-04 12:49:36 -04:00
Ryan Poplin 320956ee4b Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage. 2012-06-04 10:55:36 -04:00
Guillermo del Angel 7a54baf08c Merged bug fix from Stable into Unstable 2012-06-03 08:42:08 -04:00
Guillermo del Angel 47df7bbc14 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-06-03 08:38:54 -04:00
Guillermo del Angel 2ddbdee3bc Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow 2012-06-03 08:38:38 -04:00
Mauricio Carneiro 12a8c54f9a Fixing VCF header for filter elements (thanks Eric) 2012-06-01 15:45:15 -04:00
Eric Banks 3a15ba2102 Malformed VCF headers should be User Errors 2012-05-31 16:05:53 -04:00
Khalid Shakir c4f7df4dce When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>". 2012-05-30 16:39:06 -04:00
Ryan Poplin 421d0d1435 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-30 15:21:35 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Eric Banks d09b8d5584 Fixing docs 2012-05-30 13:24:08 -04:00
Mauricio Carneiro d6e1205310 Updating default values for DiagnoseTargets 2012-05-30 12:43:07 -04:00