Commit Graph

3232 Commits (0f5bb706ffd3e02a2fec45f32317f29e4e45cb6d)

Author SHA1 Message Date
Mark DePristo 39c849aced Bugfix to ensure the DB=1 old files decode properly 2012-06-26 15:28:33 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 11dbfc92a7 Horrible bugfix to decodeLoc() in BCF2Codec
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
2012-06-26 15:28:32 -04:00
Mark DePristo 7dbba465ee Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf 2012-06-26 15:28:32 -04:00
Roger Zurawicki 7eb3e4da41 Added integration Tests for DiagnoseTargets
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-25 17:02:46 -04:00
Joel Thibault f0c54d99ed Account for a null attributes object
* field attributesCanBeModified - a null attributes object can't be modified in its current state
* method makeAttributesModifiable() - initialize a null attributes object to empty
2012-06-25 12:07:36 -04:00
Joel Thibault fd9effbfe2 Fix Exception typo 2012-06-25 12:05:04 -04:00
Ryan Poplin 429ad44421 Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start. 2012-06-22 14:53:29 -04:00
Ryan Poplin 735b59d942 Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero. 2012-06-22 12:38:48 -04:00
Ryan Poplin 0650b349d7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-22 10:42:49 -04:00
Guillermo del Angel eed32df30d a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this 2012-06-21 21:19:55 -04:00
Mark DePristo 734756d6b2 Final fixes before BCF2 mark III push
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
2012-06-21 15:17:22 -04:00
Mark DePristo 31ee8aa01a JEXL update
-- Update to 2.1.1 from 2.0
-- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching.  So "AF < 0.5" works even in the presence of multi-allelics now.
--
2012-06-21 15:17:21 -04:00
Mark DePristo 549293b6f7 Bugfixes towards final BCF2 implementation
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
2012-06-21 15:17:21 -04:00
Mark DePristo 567dba0f76 Cleanup of VCF header lines and constants, BCF2 bugfixes
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place.  Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata).  Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
2012-06-21 15:16:31 -04:00
Mark DePristo fba7dafa0e Finalizing BCF2 mark III commit
-- Moved GENOTYPE_KEY vcf header line to VCFConstants.  This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header.  Still doesn't work...
-- Updating integration test files.  Moved many more files into public/testdata.  Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine.  DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code.  Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly.  UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian.  Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header.  How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values.  Genotype objects are all PASS implicitly, or explicitly filtered.  We only write out the FT values if at least one sample is filtered.  Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works.  In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
2012-06-21 15:16:27 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mauricio Carneiro ab53220635 Refactor on how RR treats soft clips
* Sites with more soft clipped bases than regular will force-trigger a variant region
   * No more unclipping/reclipping, RR machinery now handles soft clips natively.
   * implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
   * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.

note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
2012-06-21 14:02:03 -04:00
Ryan Poplin 769e190202 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-20 09:59:55 -04:00
Christopher Hartl fe1d6e3953 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-19 08:02:00 -04:00
Christopher Hartl 79ef3325bd Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations. 2012-06-19 07:50:13 -04:00
Eric Banks 62cee2fb5b Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature. 2012-06-18 21:36:27 -04:00
Eric Banks 4393adf9e7 If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it. 2012-06-18 13:36:14 -04:00
Ryan Poplin 707151f0a4 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 12:55:58 -04:00
Eric Banks 82a2c40338 Emit the MLE AC and AF in the INFO field of the UG output 2012-06-18 12:19:36 -04:00
Ryan Poplin 5ec737f008 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 08:51:48 -04:00
Ryan Poplin e3147969d9 Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled. 2012-06-18 08:51:41 -04:00
Eric Banks 677babf546 Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort. 2012-06-15 15:55:03 -04:00
Eric Banks 783b7f6899 Misc cleanup 2012-06-15 10:39:19 -04:00
Eric Banks 0c218e4822 Refactoring mostly for readability (and small performance improvement) 2012-06-15 10:36:41 -04:00
Eric Banks c54e84e739 Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations. 2012-06-15 09:28:56 -04:00
Eric Banks 61fcbcb190 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-15 02:45:57 -04:00
Eric Banks 4895fe2289 No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java) 2012-06-15 02:32:00 -04:00
Mark DePristo 0384ce5d34 Simple optimizations for BCF2Encoder
-- Inline encodeString that doesn't go via List<Byte> intermediate
-- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2
-- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation
-- Final UG integration test update
2012-06-14 16:42:39 -04:00
Mark DePristo 68eed7b313 Optimizations for VCF and BCF2
-- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking.  encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version.  Lots of contracts.  User code updated to use specialized versions where possible
-- Misc code refactoring
-- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1.  Updating MD5s accordingly
-- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations
2012-06-14 16:42:39 -04:00
Mark DePristo 09df584788 Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter 2012-06-14 16:42:39 -04:00
Mark DePristo fbc45e14d3 Cleanup formatting of VCF floats
-- Final integrationtest update before commit (and fixing new formatting changes)
2012-06-14 16:42:38 -04:00
Mark DePristo 8b01969762 More code cleanup and optimizations to BCF2 writer
-- Cleanup a few contracts
-- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields
-- A few simple optimizations
    -- VCF header samples stored in String[] in the writer for fast access
    -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset
    -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files
-- A few editorial comments in VCFHeader
2012-06-14 16:42:38 -04:00
Mark DePristo e34ca0acb1 Passing all unittests
-- Final merge conflicts resolved
-- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that
2012-06-14 16:42:38 -04:00
Mark DePristo 71da76039e Final support for variable length lists of strings in BCF2
-- Updating many MD5s as well.
2012-06-14 16:42:38 -04:00
Mark DePristo bd9d40fb84 Code cleanup and more documentation for BCFFieldWriters
-- Update integration tests where appropriate
2012-06-14 16:42:37 -04:00
Mark DePristo 856905ee5b Cleanup Genotypes
-- Renamed getAttribute to getExtendedAttribute, as this is really what this function does
-- Added a few more genotype tests
2012-06-14 16:42:36 -04:00
Mark DePristo 31997f8092 Bugfixes on the way to passing integration tests
-- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate
-- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it.  Very expensive but convenient.
-- Fixed nasty subsetting bug in SelectVariants with excluding samples
-- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT
-- Bugfix for dropping old style GL field values
-- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header
-- Bugfix for Allele.getBaseString to properly show NO_CALL alleles
-- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes
2012-06-14 16:42:33 -04:00
Mark DePristo ea1b699778 Cleanup the interface for BCF2FieldEncoder
-- Now uses a much clearer approach.  Update all user classes to new interface
2012-06-14 16:42:33 -04:00
Mark DePristo dd6aee347a Genotype encoding uses the BCF2FieldEncoder system 2012-06-14 16:42:33 -04:00
Mark DePristo 9ac4203254 GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value
-- Cleans up interface and avoids significant amounts of gross typing code
2012-06-14 16:42:32 -04:00
Mark DePristo 7506994d09 Nearing final BCF commit
-- Cleanup some (but not all) VCF3 files.  Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec.  Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
2012-06-14 16:42:32 -04:00
Mark DePristo 6272612808 Testing utility to perform diffs N times 2012-06-14 16:42:32 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo 51a3b6e25e No more makePrecisionFormatStringFromDenominatorValue
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context.  Replaced with a faster call that I think is correct. Ryan, please confirm.
2012-06-14 16:42:30 -04:00
Mark DePristo 43ad890fcc Finalizing BCF2 v2
-- FastGenotypes are the default in the engine.  Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec.  Good error handling.  Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
    -- Support for writing a sites only file
    -- Lots of TODOs for future optimizations
    -- Removed lack of filter field support
    -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
    -- Lots of docs and contracts
-- Docs for GenotypeBuilder.  More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters.  Better genotype comparisons
2012-06-14 16:42:29 -04:00
Mark DePristo 37e5d32019 Remove logger.info statement 2012-06-14 16:42:29 -04:00
Mark DePristo 01ddf9555a Performance optimizations for Genotype field decoding for GT field
-- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over
-- Contracts
-- final classes
2012-06-14 16:42:28 -04:00
Mark DePristo 7fbca7013e Don't add missing value binding from field to Genotype object in VCF3Codec 2012-06-14 16:42:28 -04:00
Mark DePristo 4a4d3cde3d UnitTests for decodeIntArray method 2012-06-14 16:42:27 -04:00
Mark DePristo 5b8bd81991 An option to not actually write out the results of select variants
-- Useful for performance testing of the SV operations themselves.
2012-06-14 16:42:26 -04:00
Mark DePristo 6f7a01e00d Bugfix for BCF2 reader / writer for > 0x0FFF samples :-)
-- Should be 0x00FFFFFF in the mask
2012-06-14 16:42:26 -04:00
Mark DePristo 1d4eb46606 Efficient reading of genotype fields v1
-- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object
-- Code cleanup / contracts added were appropriate
-- V2 will have a yet more optimized path...
2012-06-14 16:42:26 -04:00
Mark DePristo 37b8d70321 Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC 2012-06-14 16:42:25 -04:00
Mark DePristo 17fbd103d0 Smarter infrastructure to decode genotypes in BCF
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder.  The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder.  In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine.  The DiffEngine algorithm needs to be rethought to be efficient...
2012-06-14 16:42:25 -04:00
Mark DePristo 889e3c4583 Code cleanup before major refactor 2012-06-14 16:42:25 -04:00
Mark DePristo cebd37609c Finalizing new Genotype object and associated routines
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
2012-06-14 16:42:25 -04:00
Mark DePristo 8b0a629a31 Terrible bugfix
-- The way I was handling the contig offset ordering wasn't correct.  Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict.  It has nothing to do with the contig index itself.
-- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly
2012-06-14 16:42:24 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo ff9ac4b5f8 BCF2 genotype decoding is now lazy
-- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec.
-- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec
2012-06-14 16:42:23 -04:00
Mark DePristo 9eb83a0771 Enable adding contigs to VariantContextWriters on output 2012-06-14 16:42:23 -04:00
Mark DePristo b0ea14ef0f VCFHeader getMetaData returns 4.1 version not 4.0 2012-06-14 16:42:22 -04:00
Mauricio Carneiro 7d12429917 First step towards indel qualities in RR
Let the BI's and BD's pass through the reduce reads machinery
2012-06-14 15:37:39 -04:00
Mauricio Carneiro e68038c5d8 Refactor post-processing downsampling using David's generic downsampler interface 2012-06-14 15:37:32 -04:00
Eric Banks de5508fcea Bug fixes for cycle and context covariates 2012-06-14 13:01:14 -04:00
Eric Banks 5c3c6cbc40 Long -> long conversions in BQSR 2012-06-14 09:07:02 -04:00
Eric Banks 29a74908bb The next round of BQSR optimizations: no more Long[] array creation 2012-06-14 00:05:42 -04:00
Guillermo del Angel cd2074b1dc Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 20:59:30 -04:00
Guillermo del Angel 92669a0468 Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet 2012-06-13 20:59:17 -04:00
David Roazen 0550b27799 Make downsampler classes themselves generic (instead of just the Downsampler interface)
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.

Thanks to Khalid for his feedback on this issue.

Also:

-added a unit test to verify GATKSAMRecord support with no typecasting required

-added some unit tests for the FractionalDownsampler that Mauricio will/might be using

-moved classes from private to public to better sync up with my local development
branch for engine integration
2012-06-13 16:43:39 -04:00
Guillermo del Angel 67c0569f9c Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 11:50:00 -04:00
Eric Banks 81993b08e2 Don't put null entries into the key array 2012-06-13 11:43:44 -04:00
Roger Zurawicki bdf5945dcc Fixed bugs in DiagnoseTargets
DT would not report bad mates!
that has been fixed

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:26 -04:00
Roger Zurawicki 538cdf9210 Created the FindCoveredIntervals
Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class
Minor tweaks
FindCoveredIntervals supports Gathering
FindCoveredIntervals outputs an interval list instead of GATKReport

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:25 -04:00
Guillermo del Angel aee66ab157 Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact 2012-06-13 11:14:44 -04:00
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 277493dd83 Yet more instances of Lists changed over to native arrays 2012-06-12 15:56:09 -04:00
Eric Banks 613badc835 Very minor optimizations for the context covariate 2012-06-12 15:47:32 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks 1da3e43679 Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come. 2012-06-12 13:32:56 -04:00
Eric Banks fec0bd5e11 Fixing UG argument docs 2012-06-12 09:46:16 -04:00
Eric Banks a4defdfb29 Adding a GT header line to SomaticIndelDetector output 2012-06-12 09:39:17 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Eric Banks ff5749599d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 15:46:17 -04:00
Eric Banks fea625632f Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one 2012-06-11 15:45:58 -04:00
Ryan Poplin e4d371dc80 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 10:38:50 -04:00
Ryan Poplin 683d4b508e Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller. 2012-06-11 10:38:35 -04:00
Mauricio Carneiro 4aad7e23ef New ReduceReads v2 with unclipped variant regions and soft-clipped bases
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
   * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
   * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
   * Updated all integration tests
2012-06-08 14:58:31 -04:00
Eric Banks afa9b2718a Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-08 13:54:48 -04:00
Eric Banks 92280b4068 BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime. 2012-06-08 13:54:37 -04:00
Eric Banks 898a0e6161 Minor optimizations 2012-06-08 12:07:58 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks d463ab2cbf BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible. 2012-06-08 10:42:42 -04:00
Eric Banks 2bd48a7351 Bad comments made it into the previous commit 2012-06-07 23:12:56 -04:00
Eric Banks 31c3a6be48 BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR. 2012-06-07 20:04:10 -04:00
Eric Banks 0fb9179f76 BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array 2012-06-07 19:41:03 -04:00
Ryan Poplin d449f169d3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-07 10:56:55 -04:00
Ryan Poplin 0b4281fdd0 misc minor update to HC debug output for when there are a lot of samples 2012-06-07 10:56:41 -04:00
Eric Banks bad50a1b05 Fix docs 2012-06-06 22:45:38 -04:00
Eric Banks b093ba9dcc Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records) 2012-06-06 15:17:30 -04:00
Eric Banks 54f682a99c Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX. 2012-06-06 11:44:37 -04:00
Eric Banks dd46d843fb IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion. 2012-06-06 11:04:55 -04:00
Guillermo del Angel 2cbd6e5f90 Merged bug fix from Stable into Unstable 2012-06-05 15:58:23 -04:00
Guillermo del Angel ce4dc2128d Adding minor clarification to -mbq argument documentation 2012-06-05 15:17:56 -04:00
Eric Banks e02ec8c8b6 Don't update the record ID unless we are actually going to emit the record 2012-06-04 14:58:50 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin f11e7ebc3a Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA. 2012-06-04 12:49:36 -04:00
Ryan Poplin 320956ee4b Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage. 2012-06-04 10:55:36 -04:00
Guillermo del Angel 7a54baf08c Merged bug fix from Stable into Unstable 2012-06-03 08:42:08 -04:00
Guillermo del Angel 47df7bbc14 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-06-03 08:38:54 -04:00
Guillermo del Angel 2ddbdee3bc Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow 2012-06-03 08:38:38 -04:00
Mauricio Carneiro 12a8c54f9a Fixing VCF header for filter elements (thanks Eric) 2012-06-01 15:45:15 -04:00
Eric Banks 3a15ba2102 Malformed VCF headers should be User Errors 2012-05-31 16:05:53 -04:00
Khalid Shakir c4f7df4dce When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>". 2012-05-30 16:39:06 -04:00
Ryan Poplin 421d0d1435 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-30 15:21:35 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Eric Banks d09b8d5584 Fixing docs 2012-05-30 13:24:08 -04:00
Mauricio Carneiro d6e1205310 Updating default values for DiagnoseTargets 2012-05-30 12:43:07 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Eric Banks 50031b63c5 Fix possible NPE from NBaseCount annotation module 2012-05-29 09:46:00 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 7ce24a96f1 PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present 2012-05-28 20:18:16 -04:00
Mark DePristo 1818c29371 Fixed long-standing bug in beagle codec that was passing on the header record for decoding 2012-05-28 20:17:26 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Mauricio Carneiro 4109fcbb08 Merged bug fix from Stable into Unstable 2012-05-25 13:03:05 -04:00
Mauricio Carneiro 2be5704a25 Fixed haplotype boundary bug in PairHMMIndelErrorModel
haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception).

   * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate.
   * removed unnecessary code to remove hard clips after processing.
   * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).
2012-05-25 13:00:45 -04:00
Guillermo del Angel 175bb35e70 Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former) 2012-05-25 12:56:23 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo ade1843818 Bugfix for not setting header in AbstractVCFCodec 2012-05-24 10:58:58 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo 69ee4d0454 Moved getMetaDataForField to VariantContextUtils 2012-05-24 10:57:09 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 4bde24f020 Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header 2012-05-24 10:57:08 -04:00
Mark DePristo 4846bf5c8e @Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf
-- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically
2012-05-24 10:57:07 -04:00
Mark DePristo bb0d87666a Finally just deleted equals() method in GATKArgumentCollection.
-- We never compare these things in the codebase anyway...
2012-05-24 10:57:07 -04:00
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo 6301572009 GenotypeLikelihood PLs are capped at Short.MAX_INT now
-- UserExceptions in BCF2 now where appropriate
-- Asserts for code safety
-- Public -> protected encode(Object v) method is for testing only
2012-05-24 10:57:06 -04:00
Mark DePristo d52bc31a47 Bugfix for doNotWriteGenotypes mode
-- Was outputing GT ./. in sites only mode.  Fixed
2012-05-24 10:57:05 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00
Mark DePristo b5bce8d3f9 AD should be UNBOUNDED, actually
-- Pass in # alt alleles as appropriate for getCount in VCF header line
2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00
Ryan Poplin 04000d920c Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads. 2012-05-21 16:46:59 -04:00
Eric Banks 666862af19 Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs 2012-05-21 16:03:29 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 7f5ec17d22 Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table. 2012-05-21 14:16:13 -04:00
Eric Banks 92d8aa3d4c Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels 2012-05-21 09:38:52 -04:00
Eric Banks 3af3834d50 Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian):
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE.  Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
2012-05-18 11:55:41 -04:00
Eric Banks 52c206d5db Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports. 2012-05-18 02:32:20 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Guillermo del Angel 5189b06468 New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality. 2012-05-17 15:28:19 -04:00
Eric Banks 0f7c917e7a Better error checking and messages for bad alleles 2012-05-17 13:36:42 -04:00
Eric Banks d44886d9e8 Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed. 2012-05-15 10:29:04 -04:00
Eric Banks 819c3d0c15 Adding to the Hrun docs 2012-05-15 10:27:52 -04:00
Guillermo del Angel 5fc3adbb04 One more VariantsToTable bug fix 2012-05-14 14:10:07 -04:00
Guillermo del Angel 04d691f04a Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit 2012-05-14 14:01:29 -04:00
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Ryan Poplin c9dd0f3173 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-10 13:09:10 -04:00
Ryan Poplin 0cdadffe14 Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes. 2012-05-10 13:09:03 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks 4f37d6d399 Fixing docs 2012-05-10 00:56:00 -04:00
Mark DePristo c81acfc15d Working implementation of BCF2
-- Nearly complete on spec implementation.  Slow but clean
-- Some refactoring of VariantContext to support common functions for BCF and VCF
2012-05-08 19:46:51 -04:00
Mark DePristo a5193c2399 Mostly complete reference implementation of BCF2
-- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF
2012-05-08 19:46:51 -04:00
Eric Banks 473d07b0c5 fixing up docs from previous Pool Caller commit 2012-05-08 11:02:55 -04:00
Eric Banks b4999d14c1 updating docs 2012-05-08 10:58:46 -04:00
Guillermo del Angel 33a1dd2048 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 10:42:12 -04:00
Eric Banks 5cf4fd63c2 Catch malformed base qualities and throw as a User Error 2012-05-08 09:34:57 -04:00
Guillermo del Angel a4f4b5007b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 09:34:33 -04:00
Guillermo del Angel 605984353f Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model. 2012-05-08 09:33:38 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks 66838a073e Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed... 2012-05-07 12:20:11 -04:00
David Roazen 6b769e91d8 BCF2: third checkpoint
* writer mostly implemented
* walkers to convert BCF2 <-> VCF
* almost working for sites-only files; genotypes still need work
* initial performance tests this afternoon will be on sites-only files
2012-05-04 13:00:15 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Khalid Shakir b8b7f28aa9 Revving Picard to pick up new SamFileHeaderMerger.
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
2012-05-02 16:47:28 -04:00
Mauricio Carneiro f51a1d0d61 Better error message to the BAMScheduler
In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.
2012-05-02 16:10:00 -04:00
Mauricio Carneiro 940029fa5d Fixing on-the-fly recalibration (caught by Ryan)
low quality bases in the tails were being turned to N's in the final read.
2012-05-02 16:06:04 -04:00
Eric Banks 623b36fbc4 Add header lines for AC,AF, and AN tags 2012-05-02 15:33:34 -04:00
Guillermo del Angel 429800a192 Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value. 2012-05-02 09:57:06 -04:00
Guillermo del Angel 76a95fdedf Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though). 2012-05-02 09:24:28 -04:00
Joel Thibault 4d732fa586 Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb 2012-05-01 18:23:51 -04:00
Eric Banks 619a69a5f1 As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?). 2012-05-01 16:18:24 -04:00
Joel Thibault c255dd5917 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:10:38 -04:00
Ryan Poplin 51af61b5d7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:07:23 -04:00
Ryan Poplin fc55dcec3c Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now. 2012-05-01 16:02:36 -04:00
Ryan Poplin 20a0078f23 Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big. 2012-05-01 15:51:36 -04:00
Eric Banks 0f3af9555b Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case. 2012-05-01 14:58:06 -04:00
Joel Thibault aa4d41cce0 Minor cleanup before push 2012-05-01 14:16:44 -04:00
Joel Thibault b101b9c30b Add Mongo switch 2012-05-01 14:00:48 -04:00
Joel Thibault 1b609e9075 Move Mongo to server couchdb 2012-05-01 13:59:47 -04:00
Joel Thibault fd57d27f45 Move MongoDB connection handling to a separate class 2012-05-01 13:59:37 -04:00
Joel Thibault db3cd1abd5 Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes. 2012-05-01 13:57:23 -04:00
Joel Thibault 04e1be9106 Better handling of Mongo errors + exceptions 2012-05-01 13:57:23 -04:00
Joel Thibault ca737479cf Query for stop locations because we don't have that information in the reference 2012-05-01 13:57:23 -04:00
Joel Thibault 1cda87a4ad Set ROD priority list to input 2012-05-01 13:57:23 -04:00
Joel Thibault a7fe847faf Set the priority list and don't bother combining if not needed 2012-05-01 13:57:23 -04:00
Joel Thibault f739305f43 Combine the variants found at a location 2012-05-01 13:57:23 -04:00
Joel Thibault 020f884d5a Use new key of source ROD plus alleles 2012-05-01 13:57:23 -04:00
Joel Thibault 221ce9c3d6 Add alleles to the primary key 2012-05-01 13:57:23 -04:00
Joel Thibault 3198ce5471 Can have multiple variants at a location 2012-05-01 13:57:22 -04:00
Joel Thibault 11ed8e61c9 Add referenceBaseForIndel to the Mongo VariantContext objects 2012-05-01 13:53:44 -04:00
Joel Thibault 7ed0ee7ed0 Skip locations with no genotypes instead of throwing a NPE 2012-05-01 13:53:44 -04:00
Joel Thibault 4bdfeacdaa Handle multiple samples/genotypes per location
TODO: sample selection
2012-05-01 13:53:43 -04:00
Joel Thibault 1f7c628796 Insert the ROD filename into MongoDB as part of the primary key 2012-05-01 13:53:43 -04:00
Joel Thibault bb8a6e9b0a Initial test of write and read from MongoDB 2012-05-01 13:53:43 -04:00
David Roazen c0084c741b Pilot BCF2 Implementation: Checkpointing the code
* Not working yet, still very much a work-in-progress with lots of placeholders
* Needed to check this in to enable possible collaboration, since it's
  going slower than anticipated and the conference deadline looms.
2012-05-01 12:23:10 -04:00
Christopher Hartl 7d029b9a28 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-30 12:16:30 -04:00
Christopher Hartl 944a7d815e Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels. 2012-04-28 11:31:03 -04:00
Ryan Poplin 54a9bc2da2 Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles. 2012-04-28 09:12:26 -04:00
Ryan Poplin e332aeaf70 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-27 16:21:21 -04:00
Ryan Poplin 2b5dd28550 Bug fix in reverse trim alleles for the case of mixed records. 2012-04-27 16:21:02 -04:00
Mauricio Carneiro 1db2d1ba82 Do not add the first and last 4 cycles to the recalibration tables. 2012-04-27 15:18:07 -04:00
Mauricio Carneiro 08dbd756f3 Quick QC walkers to look at the error profile of indels in the read 2012-04-27 15:18:07 -04:00
Guillermo del Angel 730208133b Several fixes and improvements to Pool caller with ancillary test functions (not done yet):
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
2012-04-27 14:41:17 -04:00
Eric Banks 0439047269 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-27 10:49:45 -04:00
Eric Banks 05b44dd017 The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods. 2012-04-27 10:49:36 -04:00
Khalid Shakir 9801dd114f Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
2012-04-27 09:58:38 -04:00
Guillermo del Angel 972d6531b6 Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case. 2012-04-26 10:15:26 -04:00
Laurent Francioli 219b0a128b PED support for ChromosomeCounts annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:50:04 -04:00
Laurent Francioli 19d5213d5a Added function to get founders IDs in SampleDB
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:49:36 -04:00
Mauricio Carneiro 902277856e fix for RBP getPileupsForSamples()
do not differentiate per sample pileups from generic pileups. Do the same for both -- it's O(n) either way.
2012-04-24 17:20:30 -04:00
Mauricio Carneiro 82b4798913 CountBasesWalker -- a quick QC walker. 2012-04-24 17:20:30 -04:00
Mauricio Carneiro e440d0ce69 BQSR triage #4
* fixed queue script plot file names
   * updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
   * fixed plotting of marginalized reported qualities
2012-04-24 17:19:54 -04:00
Eric Banks d6277b70d8 Forgot to consider the optimized case in hasAllele 2012-04-24 11:32:28 -04:00
Eric Banks 91bad244d5 Using a VCF whose ALT is the reference in GGA mode is a User Error 2012-04-24 11:08:37 -04:00
Eric Banks 74ad008163 Adding VariantContext.hasAlternateAllele functionality 2012-04-24 11:07:46 -04:00
Eric Banks 66f3315548 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-24 09:39:55 -04:00
Eric Banks bcb93dda5f Fixing docs (rank sum test values are not phred-scaled) 2012-04-24 09:39:42 -04:00
Mauricio Carneiro e39a59594a BQSR triage and test routines
* updated BQSR queue script for faster turnaround
   * implemented plot generation for scatter/gatherered runs
   * adjusted output file names to be cooperative with the queue script
   * added the recalibration report file to the argument table in the report
   * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
   * added RecalibrationReport unit test -- guarantees the integrity of the delta tables
2012-04-23 11:23:00 -04:00
Eric Banks a733723439 Merged bug fix from Stable into Unstable 2012-04-23 10:30:30 -04:00
Eric Banks 2761da975e Handle null VCs (which can arise when indels are present in the file) 2012-04-23 10:30:00 -04:00
Eric Banks 63aa79df82 Slightly better error message 2012-04-23 09:37:28 -04:00
Eric Banks 7b5fbf9567 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-23 09:34:08 -04:00
Eric Banks 4edb005411 Catch poorly formatted PL/GL fields 2012-04-23 09:33:50 -04:00
Ryan Poplin 35bb55f562 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-22 13:23:36 -04:00
Ryan Poplin 18e4532d10 Turning down the amount of assembly graph pruning slightly in the case of low coverage. 2012-04-22 13:23:24 -04:00
Eric Banks 1f23d99dfa If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did. 2012-04-20 17:00:05 -04:00
Eric Banks 4b81c75642 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:30:19 -04:00
Eric Banks f1c5510ec0 When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic. 2012-04-20 14:30:04 -04:00
Ryan Poplin a1596791af Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:03:04 -04:00
Ryan Poplin a57295eb75 Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold). 2012-04-20 14:02:55 -04:00
Guillermo del Angel de68363c23 Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls 2012-04-20 10:58:34 -04:00
Guillermo del Angel d2488dfb81 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 19:40:03 -04:00
Guillermo del Angel c44c7b9a97 Restored optimization in Pair HMM only to compute HMM matrices starting in index where haplotypes start to diverge - saves about 15-20% of runtime which is what we lost by disabling banding in latest version, so runtime should be now about the same as what it was before refactoring. Output is bit-true to previous commit 2012-04-19 19:39:43 -04:00
Mauricio Carneiro 0f8c77391d BQSR bug triage #3
* fixed context covariate famous "off by one" error
   * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
   * remove context downsampling in BQSR R script
2012-04-19 17:31:04 -04:00
Khalid Shakir df5dd841af AC strat now checks if evals will be merged before throwing an error on multiple eval files.
Minor tweaks to WGP script based on new recal VCF format.
2012-04-19 16:08:55 -04:00
Guillermo del Angel 1ae2ab5b63 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 12:50:29 -04:00
Guillermo del Angel 0e6e0cb907 Merging bug fixes 2012-04-19 12:49:30 -04:00
Eric Banks 79272c5e15 Thanks to Menachem for pointing out that the docs for genotyping_mode and output_mode were the same (and unclear). Fixed. 2012-04-19 12:48:09 -04:00
Guillermo del Angel 02ff930f6a My changes 2012-04-19 12:45:18 -04:00
Eric Banks 2485cef5b8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 11:46:06 -04:00
Eric Banks 76a6e37f4f Don't output callability metrics by default anymore; one can still have them output to the 'metrics' file (which is now @Hidden because they are really for GSA use). Added a TODO to move UG from @By reference to reads and rods once LIBS is cleaned up. 2012-04-19 11:45:56 -04:00
Ryan Poplin 1ea4e48a27 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 11:32:32 -04:00
Ryan Poplin 11001ab9a2 Adding option to HaplotypeCaller to genotype the events on the chosen haplotypes as independent events. The filtered reads are now kept around so they can be passed to the variant annotations. Unfortunately the filtered reads aren't assigned a likelihood yet so they are all thrown in the Allele.NO_CALL bin. 2012-04-19 11:32:10 -04:00
Mauricio Carneiro eb22cd7222 Unit test to guarantee BQSR sequential calculation accuracy
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
2012-04-19 09:33:40 -04:00
Mauricio Carneiro 68d0211fa1 Improved BQSR plotting and some new parameters
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
   * Refactored the CycleCovariateUnitTest to test the pairing information
   * Updated BQSR Integration tests accordingly
   * Made quantization levels parameter not hidden anymore
   * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
   * Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00
Guillermo del Angel 143e92b797 Rebasing 2012-04-18 20:05:43 -04:00
Guillermo del Angel 82efd4457e Revert some bad merge changes 2012-04-18 16:35:09 -04:00
Guillermo del Angel 31c394d588 Resolve merge conflicts 2012-04-18 16:25:03 -04:00
Ryan Poplin 4999ae87ad Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-18 15:02:42 -04:00
Ryan Poplin dcc4871468 minor misc optimizations to PairHMM 2012-04-18 15:02:26 -04:00
Eric Banks d3c84e7b1f This should be a User Error since it's provided from the DoC command-line arguments 2012-04-18 13:09:23 -04:00
Eric Banks 392f1903f7 Handling some of the NumberFormatExceptions seen via Tableau that are really user errors. 2012-04-18 12:57:37 -04:00
Ryan Poplin 8a84456626 Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible. 2012-04-18 11:24:04 -04:00
Eric Banks 4448a3ea76 Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position. 2012-04-17 23:54:10 -04:00
Eric Banks c1f52b773a Minor tweaks and updated integration tests MD5s 2012-04-17 23:17:28 -04:00
Eric Banks 6d03bce0d3 Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file.
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration.  For 1000G calling this was prohibitive in terms of memory requirements.  Now we go through the rod system and pull in just the records we need at a given position.

As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
2012-04-17 22:38:18 -04:00
Mauricio Carneiro 46a212d8e9 Added "simplify reads" option to PrintReads. 2012-04-17 19:32:34 -04:00
Mauricio Carneiro f0c81b59b0 Implementation of the new BQSR plotting infrastructure
* removed low quality bases from the recalibration report.
   * refactored the Datum (Recal and Accuracy) class structure
   * created a new plotting csv table for optimized performance with the R script
   * added a datum object that carries the accuracy information (AccuracyDatum) for plotting
   * added mean reported quality score to all covariates
   * added QualityScore as a covariate for plotting purposes
   * added unit test to the key manager to operate with one required covariate and multiple optional covariates
   * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
2012-04-17 19:23:55 -04:00
Ryan Poplin 952280bef1 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-17 17:00:14 -04:00
Ryan Poplin cf705f6c62 Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller 2012-04-17 17:00:00 -04:00
Eric Banks 13c800417e Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code. 2012-04-17 15:51:23 -04:00
Guillermo del Angel c78b0eee3a Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller. 2012-04-17 14:22:48 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Ryan Poplin 1a2e92f8db Merged bug fix from Stable into Unstable 2012-04-17 10:23:05 -04:00
Ryan Poplin adad76b36f Fixing NPE in VQSR for the case of very small callsets. 2012-04-17 10:20:43 -04:00
Mark DePristo 23ccf772d4 IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes 2012-04-13 17:00:36 -04:00
Mark DePristo 84d1e8713a Infrastructure for combining VariantEvaluations
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Mark DePristo 38986e4240 Documentation for StratificationManager 2012-04-13 17:00:36 -04:00
Mark DePristo ab06d53867 Useful test constructor or Unit tests in RefMetaDataTracker 2012-04-13 17:00:36 -04:00
Mark DePristo 285e61a227 Bugfix for IndelSummary
-- multi allelic count should be % not ratio
2012-04-13 17:00:35 -04:00
Mark DePristo e6d5cb46d2 Improvements and bugfixes to IndelSummary
-- Now properly includes both bi and multi-allelic variants.  These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
2012-04-13 17:00:35 -04:00
Mark DePristo bfa966a4e9 Bugfix for OneBPIndel
-- Previously was only including 1 bp insertions in stratification
2012-04-13 17:00:35 -04:00
Mark DePristo 2aa2d9aec0 Merged bug fix from Stable into Unstable 2012-04-13 09:25:43 -04:00
Mark DePristo 27e7e17dc7 New way to handle exceptions in multi-threaded GATK
-- HMS no longer tries to grab and throw all exceptions.  Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
2012-04-13 09:23:33 -04:00
Eric Banks 818e8c2fb9 Resolving merge conflicts 2012-04-12 15:19:44 -04:00
Eric Banks 0dd571928d Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones). 2012-04-12 15:16:29 -04:00
Eric Banks f77a6d18b8 Bad conflict merge before 2012-04-12 09:56:49 -04:00
Eric Banks 33a8bdd75f Resolving merge conflicts 2012-04-12 09:51:55 -04:00
Eric Banks b659b16b31 Generate User Error for bad POS value 2012-04-12 09:49:35 -04:00
Eric Banks cc71baf691 Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing. 2012-04-12 09:18:44 -04:00
Eric Banks 5bf9dd2def A framework to get annotations working in the HaplotypeCaller (and ART walkers in general).
Adding support for active-region-based annotation for most standard annotations.  I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.

IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller).  The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system.  When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
2012-04-11 16:22:12 -04:00
Guillermo del Angel f9f8589692 Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller. 2012-04-11 13:56:51 -04:00
Eric Banks 7aa654d13f New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now 2012-04-11 13:49:09 -04:00
Eric Banks dc90508104 Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful. 2012-04-11 13:47:10 -04:00
Eric Banks f560611fe8 Merged bug fix from Stable into Unstable 2012-04-10 22:26:53 -04:00
Eric Banks f46f7d0590 Fix the stats coming out of FlagStat. I will add an integration test in unstable 2012-04-10 22:26:10 -04:00
Mauricio Carneiro cd842b650e Optimizing DiagnoseTargets
* Fixed output format to get a valid vcf
   * Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
   * Added support to overlapping intervals
   * Removed expand target functionality (for now)
   * Removed total depth (pointless metric)
2012-04-10 17:43:59 -04:00
Ryan Poplin e3cc7cc59c Resolving merge conflict. 2012-04-10 14:50:27 -04:00
Ryan Poplin a4634624b7 There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function. 2012-04-10 14:48:23 -04:00
Eric Banks 10e74a71eb We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior. 2012-04-10 12:30:35 -04:00
Mark DePristo b43d21056b Merged bug fix from Stable into Unstable 2012-04-10 09:42:09 -04:00
Mark DePristo 6885e2d065 UserException fixes for GATK_logs recent errors
-- SamFileReader.java:525
-- BlockCompressedInputStream:376

These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Mark DePristo 8507cd7440 Throw UserException for bad dict / chain files 2012-04-10 07:22:43 -04:00
Ryan Poplin cd9bf1bfc3 Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs. 2012-04-10 00:22:40 -04:00
Roger Zurawicki 9ece93ae9c DiagnoseTargets now outputs a VCF file
- refactored the statistics classes
 - concurrent callable statuses by sample are now available.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel 719ec9144a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-09 14:53:19 -04:00
Guillermo del Angel 550179a1f7 Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors. 2012-04-09 14:53:05 -04:00
Eric Banks ea4300d583 Refactoring so that Unified Argument Collection doesn't use deprecated classes. 2012-04-09 13:45:17 -04:00
Eric Banks 6ddf2170b6 More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice. 2012-04-09 11:46:16 -04:00
Mauricio Carneiro 87e6bea6c1 Adding engine capability to quantize qualities.
* Added parameter -qq to quantize qualities using a recalibration report
   * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
   * Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00
Mark DePristo 45fc0ea98d Improvements to indel analysis capabilities of VariantEval
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately.  This is based on an old email from Mark Daly:

    // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
    // downstream frameshift, if we make the simplifying assumptions that 3 bp ins
    // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
    // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
    // as for deletions, and certainly would expect consistency between in/dels that
    // multiple methods find and in/dels that are unique to one method  (since deletions
    // are more common and the artifacts differ, it is probably worth looking at the totals,
    // overlaps and ratios for insertions and deletions separately in the methods
    // comparison and in this case don't even need to make the simplifying in = del functional assumption

-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
2012-04-06 16:07:46 -04:00
Mark DePristo 52ef4a3e26 Function to compute whether a VariantContext indel is part of a TandemRepeat
Returns true iff VC is an non-complex indel where every allele represents an expansion or
 contraction of a series of identical bases in the reference.

 The logic of this function is pretty simple.  Take all of the non-null alleles in VC.  For
 each insertion allele of n bases, check if that allele matches the next n reference bases.
 For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
 as it must necessarily match the first n bases.  If this test returns true for all
 alleles you are a tandem repeat, otherwise you are not.  Note that in this context n is the
 base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Mark DePristo 08fab49d30 Added function to get bases from the current base forward in the window in ReferenceContext 2012-04-06 16:07:46 -04:00
Ryan Poplin c77104b815 Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL 2012-04-06 00:22:52 -04:00
Mauricio Carneiro a19c27297f continuing the BQSR triage...
* fixed the loading of the new reduced size reports
   * reduced BQSR scala script memory to 2Gb
   * removed dcov parameter from BQSR scala script
   * fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
   * updated md5's with the proper PHRED scaled EstimatedQReported
2012-04-05 14:34:15 -04:00
Eric Banks 3561056a9c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-05 10:49:26 -04:00
Eric Banks 5c3ddec4c2 Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update. 2012-04-05 10:49:08 -04:00
Mauricio Carneiro 7c3b3650bb BQSR bug triage
* fixed bug where some keys were using the same recal datum objects
    * fixed quantization qual calculations when combining multiple reports
    * fixed rounding error with empirical quality reported when combining reports
    * fixed combine routine in the gatk reports due to the primary keys being out of order
    * added auto-recalibration option to BQSR scala script
    * reduced the size of the recalibration report by ~15%
    * updated md5's
2012-04-05 09:32:18 -04:00
Eric Banks 2c956efa53 Minor fixups to GenotypeLikelihoods 2012-04-05 09:14:37 -04:00
Mauricio Carneiro 1e65474fec Added utility to get the reference coordinate given the read coordinate 2012-04-05 09:04:20 -04:00
Guillermo del Angel 6913710e89 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 20:17:18 -04:00
Mark DePristo 76e4100d89 By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Guillermo del Angel 820216dc68 More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation 2012-04-04 16:23:10 -04:00
Ryan Poplin bfad26353a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 16:04:50 -04:00
Ryan Poplin dda2173c66 Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned. 2012-04-04 16:04:29 -04:00
Mark DePristo fcdd65a0f4 Bugfix for IndelLengthHistogram
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo 1ccea866d8 VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Eric Banks 9e32a975f8 Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore. 2012-04-04 13:47:59 -04:00
Eric Banks 337ff7887a When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals. 2012-04-04 10:57:05 -04:00
Guillermo del Angel 05d8400468 Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet) 2012-04-03 20:51:24 -04:00
Guillermo del Angel 5abb07da5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 17:00:45 -04:00
Christopher Hartl a6837d31d4 Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it. 2012-04-03 16:13:16 -04:00
Guillermo del Angel 63b1e737c6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 15:43:50 -04:00
Guillermo del Angel 9e11b4f9a7 Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced. 2012-04-03 15:43:32 -04:00
Eric Banks f9ce9962c4 Minor changes to verbose mode 2012-04-03 10:53:48 -04:00
Eric Banks f6aa95685d OutOfMemory exceptions are User Errors 2012-04-02 22:46:56 -04:00
Eric Banks 659b82e74d Old -B syntax is long gone at this point. Safe to remove the warning. 2012-04-02 22:25:16 -04:00
Eric Banks 99d27ddcc4 Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now. 2012-04-02 14:27:36 -04:00
Mark DePristo 6b7a00061a VariantsToTable now works with multiple input VCFs 2012-04-02 09:13:35 -04:00
Mark DePristo fbbb8509ad Final commits to VariantEval
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Mark DePristo 4b45a2c99d Final version of new VariantEval infrastructure.
*** WAY FASTER ***
 -- 3x performance for multiple sample analysis with 1000 samples
 -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
 -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2

-- Remove the TableType system, as this was way too complex.  No longer possible to embed what were effectively multiple tables in a single Evaluator.  You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis.  IndelLengthHistogram is now a @Molten data type.  GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints.  You get an error if you do.
-- Simplified entire IO system of VE.  Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo 8c0718a7c9 Fixed missing import 2012-03-30 15:31:55 -04:00
Mark DePristo 097ed4ecc4 Memory usage optimizations and safety improvements to StratNode and StratificationManager
-- Added memory and safety optimizations to StratNode and StratificationManager.  Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation.  The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement.  Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo b335c22f6d Fully refactored, mostly cleaned up version of VariantEval using StratificationManager 2012-03-30 15:31:55 -04:00
Mark DePristo c8086a79e3 New StratificationManager based VariantEval passes unmodified integration tests
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo d37f31e349 First version of VariantEval that runs (approximately correctly) with new StratificationManager 2012-03-30 15:31:54 -04:00
Mark DePristo 8971b54b21 Phase II of Stratification manager
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V.  All key functions are implemented.  Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo a3d896d80e Part I of creating a fast state space lookup for VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates).  This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Eric Banks 533c283783 Deprecating AlignmentContext.getExtendedEventPileup(). At this point the only walkers left with any relaiance on extended events are Guillermo's pooled code (he'll update soon) and the Pileup walker. David, I'll leave that last one for you (it should be easy). We can now officially rip the extended event code from the engine. 2012-03-30 10:37:14 -04:00
Eric Banks 6b49af253b Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there. 2012-03-30 10:33:30 -04:00
Eric Banks b467cd1dae Removing dependence on extended events for the remaining Variant Annotator modules. 2012-03-30 09:05:26 -04:00
Eric Banks b21889812d Removing some more usages of extended events. Not done yet, but almost there. 2012-03-30 01:51:37 -04:00
Eric Banks ad6ace2439 Resolving merge conflicts 2012-03-30 01:51:09 -04:00
Eric Banks f4d4969f23 Don't ever return null for the list of GL models 2012-03-30 00:22:40 -04:00
Eric Banks 44ac49aa34 Removing dependencies in the annotations on extended events. Some refactoring involved in this. 2012-03-30 00:17:02 -04:00
Mauricio Carneiro cbd21c6339 Nasty, nasty.....
VariantEval is overly abusive of the GATKReport (lack of) spec.

   1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
   2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.

   Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.

   No wonder, the printing needed to be O(n^2).
2012-03-29 17:49:53 -04:00
Eric Banks c2e27729c7 Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly. 2012-03-29 17:08:25 -04:00
Ryan Poplin 6da9571829 resolving merge conflicts. 2012-03-29 16:16:28 -04:00
Ryan Poplin ca96544ed0 All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads. 2012-03-29 16:14:29 -04:00
Eric Banks e4469a83ee First attempt at removing all traces of extended events from UG; integration tests are expected to fail. 2012-03-29 14:59:29 -04:00
Eric Banks e61e162c81 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-29 12:33:13 -04:00
Mauricio Carneiro cf364f26a0 Fixing alignment issue with the GATKReportColumn algorithm
Numeric columns were being left-aligned when they should be right-aligned. Fixed it.
2012-03-29 12:28:49 -04:00
Mauricio Carneiro f80bd4276a fixed estimated Q reported calculation in the gatherer 2012-03-29 12:28:43 -04:00
Mauricio Carneiro 8a9fb514b6 simplifying GATKReportColumn constructor logic 2012-03-29 12:28:37 -04:00
Eric Banks e861106398 Accidentally erased important line 2012-03-29 11:08:54 -04:00
Eric Banks e4a225ed09 Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally. 2012-03-29 11:07:37 -04:00
Guillermo del Angel c9c3f6b0fc Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity 2012-03-29 11:05:42 -04:00
Ryan Poplin 9684a2efb0 HaplotypeCaller: Variants found on the same haplotype are now written out with phased genotypes. There are serious eval issues with MNPs so disabling them for now. 2012-03-29 09:41:29 -04:00
Guillermo del Angel 250adca350 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 21:01:49 -04:00
Guillermo del Angel e0ab4e4b30 Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic 2012-03-28 21:01:31 -04:00
Mauricio Carneiro 8f0e9d74ce GATKReportTable output refactor
writing out a GATKReportTable was O(n^2)!!!!!
New implementation is O(n). What a difference, when N = 2^16...
2012-03-28 17:19:12 -04:00
Guillermo del Angel 62ee31afba Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 16:00:38 -04:00
Guillermo del Angel 1eee9d512d Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal) 2012-03-28 15:41:39 -04:00
Mauricio Carneiro bb36cd4adf Quick fixes to BQSRGatherer and GATKReportTable
* when gathering, be aware that some keys will be missing from some tables.
   * when a gatktable has no elements, it should still output the header so we know it had no records
2012-03-28 09:07:54 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel 08f7d47d7c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-28 07:42:09 -04:00
Mark DePristo 12aa72f200 Merged bug fix from Stable into Unstable 2012-03-27 22:43:00 -04:00
Mark DePristo 979a84a252 Bugfix for thread unsafe PL cache
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly.  My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Guillermo del Angel 8f34412fb8 First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing 2012-03-27 20:59:44 -04:00
Guillermo del Angel ed322bd73f Fix again merge issues 2012-03-27 15:03:13 -04:00
Guillermo del Angel b4a7c0d98d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-27 15:01:03 -04:00
Guillermo del Angel 343a061b1c Fix merge issues when incorporating new AF calculations changes 2012-03-27 15:00:44 -04:00
Mauricio Carneiro 1b75663178 BQSR Gatherer implementation and integration tests
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
   * optmized empirical qual calculation when merging recalibration reports
   * centralized the quality score quantization functionalities
   * unified the creating/loading of all the key manager/hash table structures.
   * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
   * added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Ryan Poplin 5dbd3625cd Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data. 2012-03-27 13:38:52 -04:00
Eric Banks c112e0824a I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there. 2012-03-27 11:09:03 -05:00
Mark DePristo a638996fe2 Cleanup of VariantEval, diatribe about performance problems with StateKey
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
2012-03-27 11:56:24 -04:00
Mark DePristo 679bb03014 Simple utility function for converting an Iterable<T> to Collection<T> 2012-03-27 11:54:58 -04:00
Mark DePristo 1f5f737c8b Optimizing the GATKReportTable.write
-- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables
2012-03-27 11:54:35 -04:00
Mark DePristo 913c8b231f Fix ErrorRatePerCycle to overload equals and hashcode
-- Fixes failing integration tests
2012-03-27 10:35:32 -04:00
Eric Banks c07a577ba3 Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously). 2012-03-27 00:27:44 -05:00
Mark DePristo 34ea443cdb Better algorithm for choosing which indel alleles are present in samples
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5.  This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc.  If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general.  We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Mark DePristo 11b6fd990a GATKReportColumn optimizations
-- Was TreeMap even though the sorting wasn't used.  Replaced with LinkedHashMap.
2012-03-26 16:28:49 -04:00
Mark DePristo 6be5e82860 VariantEval scalability optimizations
-- StateKey no longer extends TreeMap.  It's now a final immutable data structure that caches it's toString and hashcode values.  TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function.
-- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils
-- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects.  Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE
-- VariantEvaluator base class has a cached getSimpleName() function
-- VEUtils: general cleanup due to refactoring of StateKey
-- VEWalker: much better iteration of map data structures.  If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet.  This is far better than iterating over the keys and calling get() on each key.
2012-03-26 16:28:48 -04:00
Guillermo del Angel 1c424c0daf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-26 15:15:50 -04:00
Ryan Poplin 019145175b Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method 2012-03-26 11:32:44 -04:00
Ryan Poplin 1fa66f76c9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 23:04:47 -04:00
Guillermo del Angel ce617b2dfc Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code 2012-03-25 10:20:21 -04:00
Guillermo del Angel db54c2625f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 09:53:35 -04:00
Guillermo del Angel deb4586559 Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1 2012-03-24 21:49:43 -04:00
Mark DePristo b063bcd38d Removing update0 support in VariantEval
-- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations.
-- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications.
-- No need to modify the integration tests, this optimization doesn't change the result of the calculation
2012-03-23 21:02:21 -04:00
Mauricio Carneiro 0509d316d9 More information in the recalibration report
* added empirical quality counts to allow quantization during on-the-fly recalibration to any level
   * added number of observations and errors to all tables to enable plotting of all covariates
2012-03-23 16:15:19 -04:00
Mauricio Carneiro 9f74969e3a BQSR with GATKReport implementation
* restructured BQSR to report recalibrated tables.
   * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
   * linked quality score quantization to the BQSR stage, outputting a quantization histogram
   * included the arguments used in BQSR to the GATK Report
   * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities

On-the-fly recalibration with GATK Report

   * loads all tables from the GATKReport using existing infrastructure (with minor updates)
   * implemented initialiazation of the covariates using BQSR's argument list
   * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
   * applied quality quantization to the base recalibration
   * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
2012-03-23 15:42:32 -04:00
Mauricio Carneiro f421062b55 Updated read group covariate to use sample.lane instead of the id
Added Unit test.
2012-03-23 15:24:07 -04:00
Mauricio Carneiro 539da9e3e1 Fixing GATKReport exception handling when loading a report
* allowing tables with no description to go through
   * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)
2012-03-23 15:23:13 -04:00
Eric Banks 2511839068 Merged bug fix from Stable into Unstable 2012-03-23 13:51:33 -04:00
Eric Banks d3f2bc4361 Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles. 2012-03-23 13:51:00 -04:00
Mark DePristo e4ec90cfce Merged bug fix from Stable into Unstable 2012-03-23 11:27:34 -04:00
Mark DePristo ff26f2bf68 HierarchicalMicroScheduler no longer attempts to wrap exceptions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors.  Now HMS simply checks whether the throwable it received on error was a RuntimeException.  If so, it is stored and rethrow without wrapping later.  If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
2012-03-23 11:27:21 -04:00
Ryan Poplin 9d22471b79 Merged bug fix from Stable into Unstable 2012-03-23 10:48:34 -04:00
Ryan Poplin ab288354e9 Better error message for malformed input recal file. 2012-03-23 10:47:01 -04:00
Mark DePristo fee8d86f63 VariantEval optimization
-- Use a LinkedHashMap not a TreeMap so iteration is faster.
-- Note that with a lot of stratifications the update0 is taking up a lot of time.  For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call
2012-03-22 22:13:24 -04:00
Mark DePristo 6df96644d9 Unified, standard IndelSummary metrics for VariantEval
-- Now you always get SNP and indel metrics with VariantEval!
--   Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
2012-03-22 21:24:37 -04:00
Mark DePristo bcf80cc7b3 Cleanup in VariantEval. Example of molten VariantEval output
-- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share.  Code updated to use these routines where appropriate
-- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF
-- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication.
-- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden.
-- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets.  See IndelHistogram class (in later commit) for example of molten VE output
2012-03-22 21:24:37 -04:00
Mark DePristo 9ddd5aec93 More eval modules being removed from VariantEval
-- IndelStatistics is superceded by IndelStatistics
2012-03-22 21:24:36 -04:00
Mark DePristo bd5b6d1aba Remove no longer in use Eval modules from VariantEval
-- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit)
-- No more SamplePreviousGenotypes or PhaseStats
-- No more MultiallelicAFs
2012-03-22 21:24:36 -04:00
Menachem Fromer 7faa9938b1 Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 17:43:44 -04:00
Menachem Fromer b9b9219ac7 Added respectPhaseInInput flag to RBP and integration tests 2012-03-22 17:40:21 -04:00
Guillermo del Angel f198cec5e2 Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated 2012-03-22 15:46:39 -04:00
Menachem Fromer 1dfaacfeb5 Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing 2012-03-22 12:40:15 -04:00
Guillermo del Angel b02ef95bcf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 12:14:12 -04:00
Guillermo del Angel 92676c63ca Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests 2012-03-22 12:13:59 -04:00
Guillermo del Angel 58965d6a6e Merged bug fix from Stable into Unstable 2012-03-22 11:04:11 -04:00
Guillermo del Angel b8cd959461 Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list 2012-03-22 10:06:22 -04:00
Ryan Poplin a29fc6311a New debug option to output the assembly graph in dot format. Merge nodes in assembly graph when possible. 2012-03-21 15:48:55 -04:00
Eric Banks 8c09ff9459 Merged bug fix from Stable into Unstable 2012-03-21 12:44:43 -04:00
Eric Banks 58245bfa2f Bug fix: check to see whether there's a BasePileup before asking for one. 2012-03-21 12:44:09 -04:00
Eric Banks 07c3bd32b3 Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention... 2012-03-21 12:42:13 -04:00
Eric Banks dcf2fa361d Minor cleanup 2012-03-21 12:14:31 -04:00
Eric Banks ab1c48745b Need to catch RuntimeExceptions coming out of Picard too so that they show up as UserErrors (some BAM errors are thrown as REs). 2012-03-21 12:13:52 -04:00
Ryan Poplin 9e10779fa7 Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests. 2012-03-21 08:45:42 -04:00
Mauricio Carneiro 0e93cf5297 Taking care of bad cigars in the GATK
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
   * added Unit tests for BadCigarFilter
   * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
   * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
2012-03-20 14:32:57 -04:00
Eric Banks 5e79046c98 Minor change but I realized from Mark's commit that the code I stole it from was flawed 2012-03-20 08:55:56 -04:00
Eric Banks ade1971581 Since we allow any generic header types, there's no longer any reason to check for supported types 2012-03-20 00:12:17 -04:00
Eric Banks 2324c5a74f Simplified the interface for simple VCF header lines by making the VCFSimpleHeaderLine not abstract anymore - now any arbitrary header line with an ID (e.g. the contig and ALT lines) can be part of this class without having to define new classes. Also, renamed the 'named' header line to 'id' since that's more accurate. 2012-03-19 21:29:24 -04:00
Roger Zurawicki 7afb333811 GATK Report code cleanup
- Updated the documentation on the code
 - Made the table.write() method private and updated necessary files.
 - Added a constructor to GATKReport that takes GATKReportTables
 - Optimized my code

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-19 11:53:57 -04:00
Mauricio Carneiro 0d4ea30d6d Updating the BQSR Gatherer to the new file format
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
2012-03-19 09:02:27 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 344a938a70 When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array. 2012-03-18 00:36:30 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Mauricio Carneiro ec4a870a0f Added @PG tag to ReduceReads
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
2012-03-16 14:09:07 -04:00
Mauricio Carneiro 3bfca0ccfd BitSet implementation of the on-the-fly recalibration using the CSV format file.
Infrastructure:
   * Added static interface to all different clipping algorithms of low quality tail clipping
   * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
   * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
   * EventType is now an independent enum with added capabilities. All functionality is now centralized.

 BQSR and RecalibrateBases:
   * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
   * Refactored the object creation to take advantage of the compact key structure
   * Replaced nested hash maps with single hash maps indexed by bitsets
   * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
   * Excluded contexts with N's from the output file.
   * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
   * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
   * Added the covariate ID (for optional covariates) to the output for disambiguation purposes
   * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
   * Reduced memory usage of the BQSR script to 4

 Tests:
   * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
   * Added tests for keys without optional covariates
   * Added tests for on-the-fly recalibration (but more tests are necessary)
2012-03-16 13:02:15 -04:00
Mauricio Carneiro ca11ab39e7 BitSets keys to lower BQSR's memory footprint
Infrastructure:
	* Generic BitSet implementation with any precision (up to long)
	* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
	* Memoized implementation of the BitSet utils for better performance.
	* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
	* Replace log/sqrt with bitwise logic to get rid of numerical issues

 BQSR:
	* All covariates output BitSets and have the functionality to decode them back into Object values.
	* Covariates are responsible for determining the size of the key they will use (number of bits).
	* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
	* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead

 Tests:
	* Unit tests added to every method of BitSetUtils
	* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
	* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
2012-03-16 13:01:48 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 1429ddcf55 Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine 2012-03-14 21:25:43 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00
Mark DePristo e440c9be98 Clean up logic for adding reads to ART cache
-- No longer has duplicate code
2012-03-14 17:26:37 -04:00
Mark DePristo 5bcb5c7433 Preliminary refactoring of ART
-- Refactored ART into clearer, simpler procedures.  Attempted to merge shared code into utility classes.
-- Added some docs
-- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive
-- Separated band-pass filtering from creation of active regions.  Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions
-- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions()
-- Many TODOs in ActivityProfile.
2012-03-14 17:26:37 -04:00
Ryan Poplin 1da8928407 HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes. 2012-03-14 15:22:21 -04:00
Guillermo del Angel eca055ccad Add option in ValidationAmplicons to only output SNPs and INDELs, ignoring complex variants (or SVs, etc.) 2012-03-14 14:26:40 -04:00
Eric Banks f7c2c818fe Exact model memory optimization: instead of having a later matrix column pull in data from earlier ones (requiring us to keep them around until all dependencies are hit), the earlier columns push data into their dependents immediately and then are removed. This does trade off speed a little bit (because we need to call approximateLog10Sum each time we add to a dependent instead of once in an array at the end). Note that this commit would normally not get pushed into stable, but I'm about to make a very disruptive push into stable that would make merging this from unstable a nightmare. 2012-03-14 14:02:36 -04:00
Mark DePristo 6a40ca6bec Merged bug fix from Stable into Unstable 2012-03-14 12:19:33 -04:00
Mark DePristo bb2c10b785 Capture the class of the exception in GATKRunReport
-- As suggested by David.
2012-03-14 12:16:22 -04:00
Ryan Poplin 78a4e7e45e Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner. 2012-03-14 12:05:05 -04:00
Eric Banks 77243d0df1 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:31:51 -04:00
Eric Banks 568a1362f5 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:19:15 -04:00
Eric Banks 5d7c761784 Merged bug fix from Stable into Unstable 2012-03-13 11:01:03 -04:00
Eric Banks 5200f7f919 When creating a synthetic VC based on the passed in alleles, set the reference base for indel. 2012-03-13 10:59:58 -04:00
Eric Banks 1675bd4dd7 When creating a synthetic VC based on the passed in alleles, set the length correctly. 2012-03-13 10:55:52 -04:00
Roger Zurawicki 7887a06703 GATKReport v1.0
GATKReport format changes:

 - All non-data header lines are preceeded with a single pound ( #:)
 - Every report now has a report header containing the version number and number of tables
 - Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
 - This new format will allow reports in the future to be gatherable.
 - Changed the header format to include an end-of-line string ":;"

Added features:

 - Simplified GATK Reports:

	The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.

	A simple GATK Report consists of:
		- A single table
		- No primary key ( it is hidden )
	    Optional:
		- Only untyped columns. As long as the data is an Object, it will be accepted.
		- Default column values being empty strings.
	Limitations:
		- A simple GATK report cannot contain multiple tables.
		- It cannot contain typed columns, which prevents arithmetic gathering.

       - Added a constructor to generate simplified GATK reports.
       - Added a method to easily add data to simple GATK reports.

 - Upgraded the input parser take advantage of the new file format (v1).
 - Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
 - Made some GATKReport methods public, and added more setters and getters.
 - Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
 - The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
 - Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
 - Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.

Test changes:

 - Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
 - Updated the MD5 hashes in integration tests throughout the GATK.

Other changes:

 - Added the gatherer functions to CoverageByRG
 - Also added the scatterCount parameter in the Interval Coverage script
 - Dropped support for reading in legacy GATKReport formats ( v0.*)
 - Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
 - Rewrote the read file method for GATK report files.
 - Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-12 23:09:19 -04:00
Eric Banks 10995d349e Fix old error message 2012-03-12 22:56:08 -04:00
Eric Banks 2314787767 Generalizing to avoid JDK 1.7 incompatibilities 2012-03-12 22:50:59 -04:00
Ryan Poplin 03223029e3 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-12 09:42:37 -04:00
Eric Banks b4749757f8 Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release. 2012-03-12 01:07:07 -04:00
Ryan Poplin 2836c161ee Moving trimToVariableRegion out of reduced reads and into a public static ReadClipper function. HaplotypeCaller clips reads to the active region boundries before passing to the HMM. The philosophy of the HC is moving towards genotyping the entire haplotype sequence contained within the active region as a single allele. 2012-03-11 14:45:59 -04:00
Mark DePristo 1ee46e5c06 Collect only the bare essentials in the GATKRunReport
Now looks like:
<GATK-run-report>
   <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
   <start-time>2012/03/10 20.21.19</start-time>
   <end-time>2012/03/10 20.21.19</end-time>
   <run-time>0</run-time>
   <walker-name>CountReads</walker-name>
   <svn-version>1.4-483-g63ecdb2</svn-version>
   <total-memory>85000192</total-memory>
   <max-memory>129957888</max-memory>
   <user-name>depristo</user-name>
   <host-name>10.0.1.10</host-name>
   <java>Apple Inc.-1.6.0_26</java>
   <machine>Mac OS X-x86_64</machine>
   <iterations>105</iterations>
</GATK-run-report>

No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
2012-03-10 20:27:14 -05:00
Mark DePristo 3ba2e5667c CalibrateGenotypesLikelihoods include pOfDGivenD now 2012-03-09 16:00:07 -05:00
David Roazen 91d10431d3 BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.

Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00
David Roazen bc65f6326f Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.

Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
2012-03-09 12:33:48 -05:00
David Roazen 32dee7ed9b Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.

Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Guillermo del Angel c04853eae6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-08 12:30:04 -05:00
Guillermo del Angel 858acf8616 Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns) 2012-03-08 12:29:44 -05:00
Andrey Sivachenko 56f074b520 docs updated 2012-03-07 18:47:15 -05:00
Andrey Sivachenko 117ea605ac Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-07 18:35:07 -05:00
Andrey Sivachenko 497a1b059e transition to JEXL completed, old parameters setting individual cutoffs now deprecated 2012-03-07 18:34:11 -05:00
Andrey Sivachenko fbd2f04a04 JEXL support added; intermediate commit, not yet functional 2012-03-07 17:29:42 -05:00
Mark DePristo 0376d73ece Improved, public version of ErrorRateByCycle
-- A cleaner table output (molten).  For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Christopher Hartl a6a8fc0521 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-07 10:05:43 -05:00
Mark DePristo 569be953b9 Bugfix for VariantEval
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp.  These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
Christopher Hartl 67def6acc8 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-06 14:23:14 -05:00
Christopher Hartl 20c1fbaf0f Fixing a merge (turning off downsampling on DoC) 2012-03-06 14:22:45 -05:00
David Roazen 0702ee1587 Public-key authorization scheme to restrict use of NO_ET
-Running the GATK with the -et NO_ET or -et STDOUT options now
 requires a key issued by us. Our reasons for doing this, and the
 procedure for our users to request keys, are documented here:
 http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home

-A GATK user key is an email address plus a cryptographic signature
 signed using our private key, all wrapped in a GZIP container.
 User keys are validated using the public key we now distribute with
 the GATK. Our private key is kept in a secure location.

-Keys are cryptographically secure in that valid keys definitely
 came from us and keys cannot be fabricated, however keys are not
 "copy-protected" in any way.

-Includes private, standalone utilities to create a new GATK user key
 (GenerateGATKUserKey) and to create a new master public/private key
 pair (GenerateKeyPair). Usage of these tools will be documented on
 the internal wiki shortly.

-Comprehensive unit/integration tests, including tests to ensure the
 continued integrity of the GATK master public/private key pair.

-Generation of new user keys and the new unit/integration tests both
 require access to the GATK private key, which can only be read by
 members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Lechu 027843d791 I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them.
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2012-03-05 21:27:03 -05:00
Ryan Poplin 9b53250bef Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:07:36 -05:00
Ryan Poplin b37461587d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 17:54:59 -05:00
Ryan Poplin c6ded4d23c Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes. 2012-03-05 17:54:42 -05:00
Ryan Poplin 14a77b1e71 Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly. 2012-03-05 12:28:32 -05:00
Mauricio Carneiro e9ad382e74 unifying the BQSR argument collection 2012-03-05 10:48:26 -05:00
Ryan Poplin f879daa7d0 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 08:29:08 -05:00
Ryan Poplin d6871967ae Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably. 2012-03-05 08:28:42 -05:00
Guillermo del Angel 3b5a7c34d7 Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream 2012-03-04 10:24:29 -05:00
Mark DePristo 69611af7d3 Workaround for bug in Picard in ReadGroupProperties
-- NPE caused when you call getRunDate on a read group without a date.
2012-03-02 18:53:45 -05:00
Mark DePristo ba71b0aee4 ReadGroupProperties mk3
-- Includes sequencing date
2012-03-02 16:12:42 -05:00
Eric Banks 1e07e97b58 Optimization: create allele list just once, not for each genotype 2012-03-02 13:30:17 -05:00
Ryan Poplin 0ad7d5fbc1 Standalone common Pair HMM utility class with associated unit tests. 2012-03-01 22:41:13 -05:00
Mark DePristo 2f334a57c2 ReadGroupProperties mk2
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
2012-03-01 18:43:53 -05:00
Mauricio Carneiro 486712bfc2 ugly RG encoding 2012-03-01 17:56:45 -05:00
Mark DePristo aff508e091 ReadGroupProperties walker and associated infrastructure
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
2012-03-01 15:01:11 -05:00
Mauricio Carneiro 9e95b10789 Context covariate now operates as a highly compressed bitset
* All contexts with 'N' bases are now collapsed as uninformative
   * Context size is now represented internally as a BitSet but output as a dna string
   * Temporarily disabled sorted outputs because of null objects
2012-02-29 19:25:21 -05:00
Mauricio Carneiro d379c3763a DNA Sequence to BitSet and vice-versa conversion tools
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
  * Allows variable context size representation guaranteeing uniqueness.
  * Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
  * Unit Tests added
2012-02-29 19:25:20 -05:00
Eric Banks 129b5e7f6b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-28 10:09:34 -05:00
Eric Banks a4a279ce80 Damn you, Mark 2012-02-28 10:09:09 -05:00
Khalid Shakir 0681bea5a5 Changed DoC from PartitionType.INTERVAL to PartitionType.NONE since it doesn't have a way to gather scattered outputs.
Added MultiallelicSummary to HSP eval.
2012-02-28 09:27:27 -05:00
Eric Banks bd398e30fd Another quick optimization 2012-02-28 09:25:35 -05:00
Eric Banks 40bdadbda5 Minor optimization as per Mark 2012-02-28 09:24:07 -05:00
Eric Banks d7928ad669 Drat, missed one: handle null alleles being passed in. 2012-02-27 21:31:54 -05:00
Mark DePristo 24356f11b7 Merged bug fix from Stable into Unstable
-- Resolved conflict

Conflicts:
	public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java
2012-02-27 17:13:17 -05:00
Mark DePristo 0b29d54937 Changed most BAMSchedule ReviewedStingExceptions to UserExceptions
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
2012-02-27 17:08:41 -05:00
Mark DePristo f9e8e82e33 Removed unused class variable from VCFHeaderLineTranslator 2012-02-27 17:07:19 -05:00
Mark DePristo 100ddef930 Fix typo in VariantContextBuilder 2012-02-27 17:06:45 -05:00
Mark DePristo 5f7ccdcc01 Avoid calling getBasePileup when there's no pileup in NBaseCount annotation 2012-02-27 15:12:25 -05:00
Mark DePristo 729bb954e2 Throws ReviewedStingException for a bug when parent VariantContext argument is null 2012-02-27 15:09:00 -05:00
Eric Banks 998ed8fff3 Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous). 2012-02-27 14:56:10 -05:00
Mark DePristo 4d9582de77 More general catching of Exceptions in interval reading to throw MalformedFile exception in all cases
-- Now throws UserException no matter what happens during the reading of the intervals file.
2012-02-27 14:02:26 -05:00
Mark DePristo 9712fed7a5 Trap SAMFormatException and rethrow as MalformatedBAM exception
-- Trap errors in header and rethrow
-- Wrap underlying iterator in MalformatedBAMErrorReformattingIterator
2012-02-27 13:52:50 -05:00
Eric Banks 64754e7870 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-27 11:31:41 -05:00
Eric Banks 850c5d0db2 Enabling Rank Sum Tests for multi-allelics: use ref vs any alt allele. 2012-02-27 09:59:36 -05:00
Eric Banks dfdf4f989b Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC. 2012-02-27 09:50:09 -05:00
Guillermo del Angel 16122bea8d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-25 13:57:54 -05:00
Guillermo del Angel dea35943d1 a) Bug fix in calling new functions that give indel bases and length from regular pileup in LocusIteratorByState, b) Added unit test to cover these. 2012-02-25 13:57:28 -05:00
Mark DePristo c8a06e53c1 DoC now properly handles reference N bases + misc. additional cleanups
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
2012-02-25 11:32:50 -05:00
Guillermo del Angel c9a4c74f7a a) Bug fixes for last commit related to PileupElements (unit tests are forthcoming). b) Changes needed to make pool caller work in GENOTYPE_GIVEN_ALLELES mode c) Bug fix (yet again) for UG when GENOTYPE_GIVEN_ALLELES and EMIT_ALL_SITES are on, when there's no coverage at site and when input vcf has genotypes: output vcf would still inherit genotypes from input vcf. Now, we just build vc from scratch instead of initializing from input vc. We just take location and alleles from vc 2012-02-24 10:27:59 -05:00
Mauricio Carneiro ee9a56ad27 Fix subtle bug in the ReduceReads stash reported by Adam
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird  condition where the clear will affects it.
   * Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
2012-02-23 18:35:25 -05:00
Mark DePristo e0c189909f Added support for breakpoint alleles
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Guillermo del Angel 6866a41914 Added functionality in pileups to not only determine whether there's an insertion or deletion following the current position, but to also get the indel length and involved bases - definitely needed for extended event removal, and needed for pool caller indel functionality. 2012-02-23 09:45:47 -05:00
Eric Banks d34f07dba0 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-22 20:41:03 -05:00
Ryan Poplin 2b6c0939ab Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-22 19:00:38 -05:00
Ryan Poplin 8695738400 Bug fix in HaplotypeCaller's GENOTYPE_GIVEN_ALLELES mode for insertions greater than length 1. The allele being genotyped was off by one base pair. 2012-02-22 19:00:04 -05:00
Christopher Hartl 2c1b14d35e Mostly small changes to my own scala scripts: .vcf.gz compatibility for output files, smarter beagle generation, simple script to scatter-gather combine variants. Whole genome indel calling now uses the gold standard indel set. 2012-02-22 17:20:04 -05:00
Mauricio Carneiro 75783af6fc int <-> BitSet conversion utils for MathUtils
* added unit tests.
2012-02-21 14:10:36 -05:00
Guillermo del Angel 0f5674b95e Redid fix for corner case when forming consensus with reads that start/end with insertions and that don't agree with each other in inserted bases: since I can't iterate over the elements of a HashMap because keys might change during iteration, and since I can't use ConcurrentHashMaps, the code now copies structure of (bases, number of times seen) into ArrayList, which can be addressed by element index in order to iterate on it. 2012-02-20 09:12:51 -05:00
Ryan Poplin 3d9eee4942 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-18 10:55:29 -05:00
Ryan Poplin a8be96f63d This caching in the BQSR seems to be too slow now that there are so many keys 2012-02-18 10:54:39 -05:00
Ryan Poplin 78718b8d6a Adding Genotype Given Alleles mode to the HaplotypeCaller. It constructs the possible haplotypes via assembly and then injects the desired allele to be genotyped. 2012-02-18 10:31:26 -05:00
Guillermo del Angel e724c63f2b Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on 2012-02-17 17:18:43 -05:00
Guillermo del Angel f2ef8d1d23 Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on 2012-02-17 17:15:53 -05:00
Guillermo del Angel 3e031a540f Solve merge conflict 2012-02-17 10:56:03 -05:00
Guillermo del Angel cd352f502d Corner case bug fix: if a read starts with an insertion, when computing the consensus allele for calling the insertion was only added to the last element in the consensus key hash map. Now, an insertion that partially overlaps with several candidate alleles will have their respective count increased for all of them 2012-02-17 10:21:37 -05:00
Eric Banks 2f33c57060 No reason to restrict HaplotypeScore to bi-allelic SNPs when the plumbing for multi-allelic events is already present. 2012-02-16 13:58:00 -05:00
Guillermo del Angel 2f08846d82 Merged bug fix from Stable into Unstable 2012-02-14 21:26:25 -05:00
Guillermo del Angel 7dc6f73399 Bug fix for validation site selector: records with AC=0 in them were always being thrown out if input vcf was sites-only, even when -ignorePolymorphicStatus flag was set 2012-02-14 21:11:24 -05:00
Ryan Poplin 30085781cf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-14 14:01:20 -05:00
Ryan Poplin ae5b42c884 Put base insertion and base deletions in the SAMRecord as a string of quality scores instead of an array of bytes. Start of a proper genotype given alleles mode in HaplotypeCaller 2012-02-14 14:01:04 -05:00
David Roazen 85d31f80a2 Merged bug fix from Stable into Unstable 2012-02-13 16:37:11 -05:00
David Roazen 03e5184741 Fix serious engine bug that could cause reads to be dropped under certain circumstances
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.

Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.

Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Eric Banks ad90af94ed Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-13 15:10:10 -05:00
Eric Banks 0920a1921e Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics. 2012-02-13 15:09:53 -05:00
Eric Banks 14981bed10 Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records. 2012-02-13 14:32:03 -05:00
Ryan Poplin e9338e2c20 Context covariate needs to look in the reverse direction for negative stranded reads. 2012-02-13 13:40:41 -05:00
Ryan Poplin 41ffd08d53 On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live. 2012-02-13 12:35:09 -05:00
Ryan Poplin 3caa1b83bb Updating HC integration tests 2012-02-11 11:48:32 -05:00
Ryan Poplin 9b8fd4c2ff Updating the half of the code that makes use of the recalibration information to work with the new refactoring of the bqsr. Reverting the covariate interface change in the original bqsr because the error model enum was moved to a different class and didn't make sense any more. 2012-02-11 10:57:20 -05:00
Eric Banks f52f1f659f Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1. 2012-02-10 14:15:59 -05:00
Mauricio Carneiro 1fb19a0f98 Moving the covariates and shared functionality to public
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
2012-02-10 11:44:01 -05:00
Eric Banks 5e18020a5f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-10 11:08:33 -05:00
Eric Banks f53cd3de1b Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent. 2012-02-10 11:07:32 -05:00
Mauricio Carneiro 5af373a3a1 BQSR with indels integrated!
* added support to base before deletion in the pileup
   * refactored covariates to operate on mismatches, insertions and deletions at the same time
   * all code is in private so original BQSR is still working as usual in public
   * outputs a molten CSV with mismatches, insertions and deletions, time to play!
   * barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks 7a937dd1eb Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly. 2012-02-09 16:14:22 -05:00
Eric Banks 0f728a0604 The Exact model now subsets the VC to the first N alleles when the VC contains more than the maximum number of alleles (instead of throwing it out completely as it did previously). [Perhaps the culling should be done by the UG engine? But theoretically the Exact model can be called outside of the UG and we'd still want the context subsetted.] 2012-02-09 14:02:34 -05:00
Matt Hanna aa097a83d5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-09 11:26:48 -05:00
Matt Hanna b57d4250bf Documentation request by Eric. At each stage of the GATK where filtering occurs, added documentation suggesting the goal of the filtering along with examples of suggested inputs and outputs. 2012-02-09 11:24:52 -05:00
Mauricio Carneiro d561914d4f Revert "First implementation of GATKReportGatherer"
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.

This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Eric Banks 2f800b078c Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again. 2012-02-08 15:27:16 -05:00
Matt Hanna 51ac87b28c Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-08 08:43:55 -05:00
Matt Hanna 5b58fe741a Retiring Picard customizations for async I/O and cleaning up parts of the code to use common Picard utilities I recently discovered.
Also embedded bug fix for issues reading sparse shards and did some cleanup based on comments during BAM reading code transition meetings.
2012-02-08 08:34:37 -05:00
Roger Zurawicki c0c676590b First implementation of GATKReportGatherer
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG

Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public

The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working

Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-02-07 18:14:47 -05:00
Mauricio Carneiro e89887cd8e laying groundwork to have insertions and deletions going through the system. 2012-02-07 18:11:53 -05:00
Mauricio Carneiro 0d3ea0401c BQSR Parameter cleanup
* get rid of 320C argument that nobody uses.
   * get rid of DEFAULT_READ_GROUP parameter and functionality (later to become an engine argument).
2012-02-07 14:42:11 -05:00
Eric Banks 717cd4b912 Document -L unmapped 2012-02-07 13:30:54 -05:00
Eric Banks 718da7757e Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes. 2012-02-07 13:15:58 -05:00
Eric Banks 9d1a19bbaa Multi-allelic indels were not being printed out correctly in VariantsToTable; fixed. 2012-02-06 22:49:29 -05:00
Mauricio Carneiro 5961868a7f fixup for BQSR (HC integration tests)
In the new BQSR implementation, covariates do depend on the RecalibrationArgumentCollection.
2012-02-06 22:47:27 -05:00
Mauricio Carneiro 6e6f0f10e1 BaseQualityScoreRecalibration walker (bqsr v2) first commit includes
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
   * Updating all covariates and modules to use GATKSAMRecord throughout the code.
   * BQSR now processes indels in the pileup (but doesn't do anything with them yet)
2012-02-06 17:38:29 -05:00
Eric Banks 0717c79901 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-06 16:23:36 -05:00
Eric Banks 91897f5fe7 Transpose rows/cols in AF table to make it molten (so I can plot easily in R) 2012-02-06 16:23:32 -05:00
Guillermo del Angel fb5786385c Merged bug fix from Stable into Unstable 2012-02-06 13:22:56 -05:00
Guillermo del Angel 6ec686b877 Complement to previous commit: make sure we also don't inherit filter from input VCF when genotyping at an empty site 2012-02-06 13:19:26 -05:00
Guillermo del Angel 93ffca1e3a Merged bug fix from Stable into Unstable 2012-02-06 11:58:58 -05:00
Guillermo del Angel 827be878b4 Bug fix when running UG in GenotypeGivenAlleles mode: if an input site to genotype had no coverage, the output VCF had AC,AF and AN inherited from input VCF, which could have nothing to do with given BAM so numbers could be non-sensical. Now new vc has clear attributes instead of attributes inherited from input VCF. 2012-02-06 11:58:13 -05:00
Eric Banks fbbd04621d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-06 11:53:31 -05:00
Eric Banks edb4edc08f Commented out unused metrics for now 2012-02-06 11:53:15 -05:00
Ryan Poplin 096c23a473 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-06 11:10:38 -05:00
Ryan Poplin dc05b71e39 Updating Covariate interface with Mauricio to include an errorModel parameter. On the fly recalibration of base insertion and base deletion quals is live for the HaplotypeCaller 2012-02-06 11:10:24 -05:00
Guillermo del Angel 1e11408f8b Merged bug fix from Stable into Unstable 2012-02-06 10:34:26 -05:00
Guillermo del Angel 090d87b48b Bug fix in ValidationSiteSelector: when input vcf had genotypes and was multiallelic, the parsing of the AF/AC fields was wrong. Better logic to unify parsing of field 2012-02-06 10:33:12 -05:00
Eric Banks 9d94f310f1 Break AF histogram into max and min AFs 2012-02-06 09:01:19 -05:00
Ryan Poplin b7ffd144e8 Cleaning up the covariate classes and removing unused code from the bqsr optimizations in 2009. 2012-02-06 08:54:42 -05:00
Eric Banks cef550903e Minor optimization 2012-02-06 00:48:00 -05:00
Ryan Poplin 5343f8ba67 Initial version of on-the-fly, lazy loading base quality score recalibration. It isn't completely hooked up yet but I'm committing so Mauricio and Mark can see how I envision it will fit together. Look it over and give any feedback. With the exception of the Solid specific code we are very very close to being able to remove TableRecalibrationWalker from the code base and just replace it with PrintReads -BQSR recal.csv 2012-02-05 13:09:03 -05:00
Ryan Poplin f94d547e97 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-03 17:14:20 -05:00
Ryan Poplin 894d3340be Active Region Traversal should use GATKSAMRecords everywhere instead of SAMRecords. misc cleanup. 2012-02-03 17:13:52 -05:00
Mauricio Carneiro 4a57add6d0 First implementation of DiagnoseTargets
* calculates and interprets the coverage of a given interval track
   * allows to expand intervals by specified number of bases
   * classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
   * outputs text file for now (testing purposes only), soon to be VCF.
   * filters are overly aggressive for now.
2012-02-03 17:12:43 -05:00
Mauricio Carneiro 3dd6a1f962 Adding some generic sum and average functions to MathUtils 2012-02-03 17:12:43 -05:00
Mauricio Carneiro e1d69e4060 make the size of a GenomeLoc int instead of long
it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.
2012-02-03 17:12:42 -05:00
Ryan Poplin 0e44430e47 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-03 13:45:11 -05:00
Christopher Hartl aa3638ecb3 Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-03 13:42:09 -05:00
Eric Banks 3abfbcbcf2 Generalized the TDT for multi-allelic events 2012-02-03 12:23:21 -05:00
Ryan Poplin 601e53d633 Fix when specifying preset active regions with -AR argument 2012-02-02 16:34:26 -05:00
Christopher Hartl 0111505ea9 Terrible. Swapping the paternal and sample ids. 2012-02-02 11:41:16 -05:00
Ryan Poplin 1f50f6970b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-02 10:17:13 -05:00
Ryan Poplin 4ed06801a7 Updating HaplotypeCaller's HMM calc to use GOP as a function of the read instead of a function of the haplotype in preparation for IQSR 2012-02-02 10:17:04 -05:00
Matt Hanna 8adfc79123 Merged bug fix from Stable into Unstable 2012-02-01 16:07:41 -05:00
Matt Hanna 30b937d2af Fix bug discovered in FGTP branch in which BlockInputStream returns -1 in cases where some data could be read,
but not all the data requested by the caller.
2012-02-01 16:06:22 -05:00
Mauricio Carneiro 45da892ecc Better exceptions to catch malformed reads
* throw exceptions in LocusIteratorByState when hitting reads starting or ending with deletions
2012-02-01 11:56:19 -05:00
Christopher Hartl 810996cfca Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray! 2012-02-01 10:39:03 -05:00
Christopher Hartl 25d943f706 Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-01 10:32:11 -05:00
Ryan Poplin 056b24ccd6 Resolving merge conflicts with LocusIteratorByState 2012-01-31 16:13:32 -05:00
Ryan Poplin febc634557 Changing PileupElement's isSoftClipped to isNextToSoftClip since soft clipped bases aren't actually added to pileups, oops. Removing the intrinsic clustered variants filter from the HaplotypeCaller 2012-01-31 16:06:14 -05:00
Matt Hanna 7f70612beb Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-31 11:59:25 -05:00
Matt Hanna a630db1703 Oops...HierarchicalMicroScheduler was transforming any exception from the walker level into a ReviewedStingException.
Thanks to Ryan for pointing this out.
2012-01-31 11:58:21 -05:00
Christopher Hartl faba3dd530 Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-31 10:25:29 -05:00
Mauricio Carneiro 17dbe9a95d A few cleanups in the LocusIteratorByState
* No more N's in the extended event pileups
   * Only add to the pileup MQ0 counter if the read actually goes into the pileup
2012-01-31 09:40:51 -05:00
Ryan Poplin f9162ea705 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-30 19:45:19 -05:00
Ryan Poplin abb91cf26b Increasing the size of the active regions that are produced by the active probability integrator, more context is needed to call more complex events 2012-01-30 15:36:12 -05:00
Mauricio Carneiro d5d4fa8a88 Fixed discordance bug reported by Brad Chapman
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Mark DePristo 3164c8dee5 S3 upload now directly creates the XML report in memory and puts that in S3
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio.  There the problem is that the java.io.tmpdir is not accessible (network just hangs).  Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc.  As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use?  The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
2012-01-29 15:14:58 -05:00
Menachem Fromer 0e17cbbce9 Merged bug fix from Stable into Unstable 2012-01-27 16:03:16 -05:00
Menachem Fromer a9671b73ca Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1) 2012-01-27 16:01:30 -05:00
Ryan Poplin f7ac1f4a69 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-27 15:12:55 -05:00
Ryan Poplin fc08235ff3 Bug fix in active region traversal, locusView.getNext() skips over pileups with zero coverage but still need to count them in the active probability integrator 2012-01-27 15:12:37 -05:00
Mark DePristo 0f2e8400b5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-27 10:12:50 -05:00
Mauricio Carneiro ec9920b04f Updating the SAM TAG for Original Alignment Start to "OP"
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.
2012-01-27 08:51:39 -05:00
Mark DePristo 13d1626f51 Minor improvements in ref QC walker. Unfortunately this doesn't actually catch Chris's error 2012-01-27 08:24:22 -05:00
Mauricio Carneiro 0d4027104f Reduced reads are now aware of their original alignments
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
   * Tags keep the alignment shifts, not real alignment, for better compression
   * Tags are defined in the GATKSAMRecord
   * GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
   * Updated ReduceReads MD5s accordingly
2012-01-26 17:06:36 -05:00
Eric Banks 07f72516ae Unsupported platform should be a user error 2012-01-26 16:14:25 -05:00
Ryan Poplin cdff23269d HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close. 2012-01-26 15:56:33 -05:00
Christopher Hartl 673ceadd11 While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state. 2012-01-26 13:06:36 -05:00
Christopher Hartl 9c6fda7e15 Yup. I was right. 2012-01-26 12:54:11 -05:00
Christopher Hartl 7d059540a4 Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work.
AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").
2012-01-26 12:43:52 -05:00
Ryan Poplin 25532bdc37 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-26 11:43:32 -05:00
Ryan Poplin 390d493049 Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions. 2012-01-26 11:37:08 -05:00
Eric Banks 859dd882c9 Don't make it standard for now 2012-01-26 00:38:16 -05:00
Eric Banks c5e81be978 Adding pairwise AF table. Not polished at all, but usable none-the-less. 2012-01-26 00:37:06 -05:00
Eric Banks 702a2d768f Initial version of multi-allelic summary module in VariantEval 2012-01-25 19:42:55 -05:00
Eric Banks 9a60887567 Lost an import in the merge 2012-01-25 19:41:41 -05:00
Eric Banks cba5f1a8b1 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-25 19:19:03 -05:00
Eric Banks add6918f32 Cleaner, more efficient way of determining the last dependent set in the queue. 2012-01-25 16:21:10 -05:00
Menachem Fromer db645a94ca Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES 2012-01-25 16:10:59 -05:00
Eric Banks ef335a5812 Better implementation of the fix; PL index is now traversed in order. 2012-01-25 15:15:42 -05:00
Eric Banks 8e2d372ab0 Use remove instead of setting the value to null 2012-01-25 14:41:34 -05:00
Eric Banks 05816955aa It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed. 2012-01-25 14:28:21 -05:00
Eric Banks 2799a1b686 Catch exception for bad type and throw as a TribbleException 2012-01-25 12:15:51 -05:00
Eric Banks 96b62daff3 Minor tweak to the warning message. 2012-01-25 11:55:33 -05:00
Eric Banks fb863dc6a7 Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option. 2012-01-25 11:50:12 -05:00
Eric Banks e349b4b14b Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod. 2012-01-25 11:35:54 -05:00
Eric Banks ea3d4d60f2 This annotation requires rods and should be annotated as such 2012-01-25 11:35:13 -05:00
Ryan Poplin bbefe4a272 Added option to be able to write out the active regions to an interval list file 2012-01-25 09:47:06 -05:00
Ryan Poplin 9818c69df6 Can now specify active regions to process at the command line, mainly for debugging purposes 2012-01-25 09:32:52 -05:00
Mauricio Carneiro ffd61f4c1c Refactor the Pileup Element with regards to indels
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.

   * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
   * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
   * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
   * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
   * Every deletion now has an offset (start of the event)
   * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
   * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
   * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
   * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
   * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!

 phew!
2012-01-24 16:07:21 -05:00
Matt Hanna c312bd5960 Weirdly, PicardException inherits from SAMException, which means that our specialty code for
reporting malformed BAMs was actually misreporting any error that happened in the Picard layer
as a BAM ERROR.

Specifically changing PicardException to report as a ReviewedStingException; we might want to
change it in the future.  I'll followup with the Picard team to make sure they really, really
want PicardException to inherit from SAMException.
2012-01-24 15:30:04 -05:00
Mark DePristo 0a3172a9f1 Fix for ref 0 bases for Chris
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
2012-01-24 10:55:09 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo 02450e4b12 Merged bug fix from Stable into Unstable 2012-01-23 12:08:39 -05:00
Christopher Hartl 798596257b Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module. 2012-01-23 10:50:16 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Guillermo del Angel 31d2f04368 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-23 09:23:03 -05:00
Guillermo del Angel 966387ca0b Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring. 2012-01-23 09:22:31 -05:00
Ryan Poplin 4d6312d4ea HaplotypeCaller is now an ActiveRegionWalker. 2012-01-22 14:31:01 -05:00
Christopher Hartl 3b1aad4f17 After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call. 2012-01-20 23:43:51 -05:00
Christopher Hartl 9b4f6afa21 Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF. 2012-01-20 23:07:59 -05:00
Ryan Poplin 4b18786b5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 22:05:20 -05:00
Ryan Poplin ace9333068 Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal. 2012-01-19 22:05:08 -05:00
Menachem Fromer 066da80a3d Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output 2012-01-19 18:19:58 -05:00
Christopher Hartl 7f3ad25b01 Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF.
T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.
2012-01-19 10:54:48 -05:00
Ryan Poplin 7e082c7750 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 09:11:23 -05:00
Eric Banks ab8f499bc3 Annotate with FS even for filtered sites 2012-01-18 22:04:51 -05:00
Guillermo del Angel b123416c4c Resolve stale merge changes 2012-01-18 20:56:36 -05:00
Guillermo del Angel 2eb45340e1 Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet 2012-01-18 20:54:10 -05:00
Ryan Poplin 0268da7560 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-18 09:53:00 -05:00
Ryan Poplin 11982b5a34 We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN. 2012-01-18 09:42:41 -05:00
Mark DePristo 763c81d520 No longer enforce MAX_ALLELE_SIZE in VCF codec
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo 62801e430a Bugfix for unnecessary optimization
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo f2b0575dee Detect unreasonably large allele strings (>2^16) and throw an error
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele.  Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Ryan Poplin 8b0ddf0aaf Adding notes to CountCovariates docs about using interval lists as database of known variation 2012-01-17 16:13:13 -05:00
Matt Hanna 40ebc17437 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-17 14:49:17 -05:00
Matt Hanna 41d70abe4e At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings. 2012-01-17 14:47:53 -05:00
Ryan Poplin ae259f81cc Bug fixing for merging of read fragments when one fragment contained an indel 2012-01-17 14:39:27 -05:00
Christopher Hartl cde224746f Bait Redesign supports baits that overlap, by picking only the start of intervals.
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.

Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
2012-01-17 13:51:05 -05:00
Ryan Poplin 8e23c98dd9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-17 13:46:28 -05:00
Matt Hanna 32ccde374b Merged bug fix from Stable into Unstable 2012-01-17 11:08:35 -05:00
Matt Hanna 3ba918aff1 Error message cleanup in BAM indexing code. 2012-01-17 11:05:42 -05:00
Mauricio Carneiro cec7107762 Better location for the downsampling of reads in PrintReads
* using the filter() instead of map() makes for a cleaner walker.
   * renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mark DePristo b06074d6e7 Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe
-- Uses PriorityBlockingQueue instead of PriorityQueue
-- synchronized keywords added to all key functions that modify internal state

Note that this hasn't been tested extensivesly.  Based on report:

http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic
2012-01-13 09:33:16 -05:00
Mauricio Carneiro 28aa353501 Added "unbiased" downsampling parameter to PrintReads
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna 2c3176eb80 Merged bug fix from Stable into Unstable 2012-01-12 13:31:10 -05:00
Matt Hanna cd43f016ce Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior. 2012-01-12 13:29:11 -05:00
Eric Banks ed34b4f088 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-12 10:27:26 -05:00
Eric Banks e7fe9910f7 Create the temp storage for calculating cell values just once as per Mark's TODO 2012-01-12 10:27:10 -05:00
Eric Banks f5f5ed5dcd Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO 2012-01-12 08:50:03 -05:00
Eric Banks 410a340ef5 Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow. 2012-01-12 02:04:03 -05:00
Mauricio Carneiro 77a03c9709 Patching special case in the adaptor clipping
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
   * updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks 25d0d53d88 Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding. 2012-01-10 12:38:47 -05:00
Eric Banks 589397d611 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-10 12:36:48 -05:00
Matt Hanna e923a2e512 Revving Picard to incorporate final version of ReadWalker performance improvements. 2012-01-10 12:12:33 -05:00
Eric Banks 0f36f6947e Resolving merge conflicts 2012-01-10 11:44:16 -05:00
Eric Banks f2cecce10f Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously). 2012-01-10 11:34:23 -05:00
Matt Hanna 509c3d87b0 Merged bug fix from Stable into Unstable 2012-01-09 23:08:46 -05:00
Matt Hanna dc60757b68 Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed.
Thanks to Tim Fennell for the bug report.
2012-01-09 23:04:53 -05:00
Matt Hanna fda1795791 Merged bug fix from Stable into Unstable 2012-01-08 22:04:44 -05:00
Matt Hanna 1f1233b669 Fix for a rare but insidious bug in position tracking during async BAM file reading.
Thanks to Khalid for spotting and reporting the issue.
2012-01-08 22:03:35 -05:00
Khalid Shakir 5793625592 No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice.
QScript accessor to QSettings to specify a default runName and other default function settings.
Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs.
Gathered log files concatenate all log files together into the stdout.
InProcessFunctions now have PrintStreams for stdout and stderr.
Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml.
During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file.
In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope.
Added more detailed output when running with -l DEBUG.
Simplified graphviz visualization for additional debugging.
Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List)
Minor cleanup to build including sending ant gsalib to R's default libloc.
2012-01-08 12:11:55 -05:00
Guillermo del Angel d4e7655d14 Added ability to call multiallelic indels, if -multiallelic is included in UG arguments. Simple idea: we genotype all alleles with count >= minIndelCnt.
To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??).
More testing is required and feature turned off my default.
2012-01-06 11:24:38 -05:00
Ryan Poplin 616ff8ea01 fixed typo in help text 2012-01-06 10:36:11 -05:00
Mark DePristo dd80ffbbbe Merged bug fix from Stable into Unstable 2012-01-05 21:51:48 -05:00
Mark DePristo c96fee477c Bug fix for VariantSummary
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional.  Fixed.  Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels.  C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels.  Using this more recent and representative file probably a good idea for more future tests in VE and other tools.  File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Eric Banks f5e10e9879 Merged bug fix from Stable into Unstable 2012-01-05 15:35:09 -05:00
Eric Banks 18ed954741 Compute Ti/Tv only if bi-allelic 2012-01-05 15:33:26 -05:00
Ryan Poplin a6886a4cc0 Initial commit of the Active Region Traversal. Not ready to be used by anyone yet. 2012-01-04 17:03:21 -05:00
Guillermo del Angel 58d4539304 Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations 2012-01-04 15:28:26 -05:00
Mauricio Carneiro 9ff8a01da2 Merged bug fix from Stable into Unstable 2012-01-03 18:10:39 -05:00
Mauricio Carneiro 9b55505c03 Fixing PairHMMIndelErrorModel array out of bounds
This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.
2012-01-03 18:08:46 -05:00
Christopher Hartl 2c3a9ce02f Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-01-03 17:25:56 -05:00
David Roazen 621ee2b613 Merged bug fix from Stable into Unstable 2012-01-03 16:56:49 -05:00
Christopher Hartl 9093de1132 Cleanup: remove code to calculate the MLE AC in the UGE. 2012-01-03 15:58:51 -05:00
Christopher Hartl 2d093828a4 Final changes to Junky (been frozen for a while, but uncommitted) and the qscript for it. A first cursory implementation of the trellis-based Exact AC-constrained genotyping algorithm in UGE. Nothing calls into it, so this should be entirely safe (and, no surprise, it passes UG integration tests). 2012-01-03 15:33:04 -05:00
David Roazen ea6e718cb8 SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
Christopher Hartl 93e1417b6e Update to the VSS GATK documentation. 2012-01-03 13:39:31 -05:00
Eric Banks ab8d47d9a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-03 09:38:49 -05:00
Mauricio Carneiro 3d4bf273de Added getPileupForReadGroups to ReadBackPileup
* returns a pileup for all the read groups provided.
   * saves us from multiple calls to getPileup (which is very inefficient)
2012-01-03 09:35:11 -05:00
Mauricio Carneiro 4a208c7c06 Refactor of the downsampling machinery to accept different strategies
* Implemented Adaptive downsampler
   * Added integration test
   * Added option to RRead scala script to choose downsampling strategy
2012-01-03 09:29:47 -05:00
Mauricio Carneiro 21ae3ef5f9 Added downsampling support to ReduceReads
* Downsampling is now a parameter to the walker with default value of 0 (no downsampling)
    * Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value.
    * Added integration test
2012-01-03 09:29:46 -05:00
Mauricio Carneiro cd68cc239b Added knuth-shuffle (KS) and randomSubset using KS to MathUtils
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
         * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
         * added unit tests to both functions
2012-01-03 09:29:46 -05:00
Mauricio Carneiro 94791a2a75 Add support for reads starting with insertion
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
      * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
      * Updated all ReadClipper unit tests
      * ReduceReads does not hard clip leading insertions by default anymore
      * SlidingWindow adjusts start location if read starts with insertion
      * SlidingWindow creates an empty element with insertions to the right
      * Fixed all potential divide by zero with totalCount() (from BaseCounts)
      * Updated all Integration tests
      * Added new integration test for multiple interval reducing
2012-01-03 09:29:45 -05:00
Mark DePristo d05f0c2318 GATKPerformanceOverTime script update
-- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4)
-- Uses 1.4 now
-- By default we do 9 runs of each non-parallel test
-- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number
2012-01-02 09:58:46 -05:00
Eric Banks 393993e0c7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-31 20:42:46 -05:00
Mauricio Carneiro 55cfa76cf3 Updated integration tests for the new adaptor clipping fix. 2011-12-30 18:47:14 -05:00
Mauricio Carneiro c7d0a9ebee Forgot to test for inter-chromosomal mates in the adaptor clipping
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Matt Hanna a259bfefd4 First commit addressing problems running RTC in parallel.
Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming
standards, the RTC has revealed a few problems with the tree reducer holding on to too much data.  This is the first
and smaller of two commits to reduce memory consumption.  The second commit will likely be pushed after GATK1.4 is
released.
2011-12-29 16:22:14 -05:00
Eric Banks 1a45ea5a05 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-29 11:37:15 -05:00
Mauricio Carneiro f692911903 GATKSAMRecord emptyRead static constructor
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
 * All ReadClipper utilities now emit empty reads for fully clipped reads
2011-12-27 17:01:17 -05:00
Mauricio Carneiro 8259c748f2 No more Filtered Reads tag.
All synthetic reads are marked with the reduced read tag.
2011-12-27 17:01:17 -05:00
Eric Banks d20a25d681 A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently). 2011-12-27 16:50:38 -05:00
Eric Banks adff40ff58 Minor optimizations to avoid extra processing (esp. for reduced reads) 2011-12-27 13:16:25 -05:00
Mauricio Carneiro 17bfe48d5e Made all class methods private in the ReadClipper
* ReadClipperUnitTest now uses static methods
 * Haplotype caller now uses static methods
 * Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
Eric Banks dd990061f6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-26 14:45:35 -05:00
Eric Banks 2130b39f33 Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt. 2011-12-26 14:45:19 -05:00
Mauricio Carneiro 35c41409a1 Better contracts and docs for the ReadClipper
* Described the ReadClipper contract in the top of the class
  * Added contracts where applicable
  * Added descriptive information to all tools in the read clipper
  * Organized public members and static methods together with the same javadoc
2011-12-23 19:36:57 -05:00
David Roazen 506c0e9c97 Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
Eric Banks 24c84da60d 'Fixing' the changes in ReferenceDataSource so that a shard properly contains a list of GenomeLocs instead of a single merged one. However, that uncovered a probable bug in the engine, so instead of letting this code fester unfixed in the build (affecting everyone in the group) I've decided to revert the previous (slow, but working) version and fix the engine in my own branch. 2011-12-23 15:39:12 -05:00
Eric Banks 8762313a0d Better TODO message 2011-12-22 20:54:35 -05:00
Eric Banks a815e875a8 Removing debugging output 2011-12-22 15:49:11 -05:00
Eric Banks deef542a38 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-22 15:44:58 -05:00
Eric Banks 6d260ec6ae Start printing traversal stats after 30 seconds. I can't stand waiting 2 minutes. 2011-12-22 15:40:59 -05:00
Mauricio Carneiro cadff40247 getRefCoordSoftUnclippedStart and End refactor
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.

* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
2011-12-20 17:48:51 -05:00
Mauricio Carneiro 07128a2ad2 ReadUtils cleanup
* Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now)
 * Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations)
 * Made all meaningful functions public and added better comments/explanation to the headers
2011-12-20 17:48:40 -05:00
Mauricio Carneiro 1c4774c475 Static versions of the hard clipping utilities
For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods.

 examples:
   ReadClipper.hardClipLowQualEnds(2);
   ReadClipper.hardClipAdaptorSequence();
2011-12-20 17:48:39 -05:00
Mauricio Carneiro f73ad1c2e2 Bugfix/Rewrite: Algorithm to determine adaptor boundaries
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.

    * Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
    * Restructured the code to handle GATKSAMRecords only
    * Cleaned up the other structures and functions around it to minimize clutter and potential for error.
    * Added unit tests for all 4 cases of adaptor boundaries.
2011-12-20 17:48:39 -05:00
Mark DePristo 0cc5c3d799 General improvements to Queue
-- Support for collecting resources info from DRMAA runners
-- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4
-- NCoresRequest is a testing queue script for this.
-- Added two command line arguments:
  -- multiCoreJerk: don't request multiple cores for jobs with nt > 1.  This was the old behavior but it's really not the best way to run parallel jobs.  Now with queue if you run nt = 4 the system requests 4 cores on your host.  If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk
  -- job_parallel_env: parallel environment named used with SGE to request multicore jobs.  Equivalent to -pe job_parallel_env NT for NT > 1 jobs
2011-12-20 14:05:09 -05:00
Eric Banks 7204fcc2c3 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-20 12:59:11 -05:00
Eric Banks 8ade2d6ac2 max_alternate_alleles also ready to be made public 2011-12-20 12:59:02 -05:00
Eric Banks 6f52bd580b --multiallelic mode is not hidden anymore (but it is annotated as advanced); added docs 2011-12-20 12:47:38 -05:00
Mauricio Carneiro 37e0044c48 Removing unclipSoftClipBases from ReadUtils
* it was buggy and dangerous.
 * Updated Chris' code to use the ReadClipper.
2011-12-20 00:11:26 -05:00
Mauricio Carneiro 78d9bf7196 Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
    * Added functionality to clipping op to revert all soft clip bases in a read into matches
    * Added revertSoftClipBases function to the ReadClipper for public use
    * Wrote systematic unit tests
2011-12-20 00:04:30 -05:00
Christopher Hartl 24585062f8 Merge branch 'incoming' 2011-12-19 23:16:36 -05:00
Christopher Hartl 67298f8a11 AFCR made public (for use in VSS)
Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.
2011-12-19 23:14:26 -05:00
Eric Banks 06d385e619 Simplifying the interface a bit 2011-12-19 15:29:46 -05:00
Christopher Hartl 339ef92eac Goodbye SW by default. Now aligned reads that overlap intron-exon junctions are scored where they are by default, but warns the user (and flags the record in the VCF) if there's evidence to suggest that there is an indel throwing off the scoring (e.g. if the best score of a realigned unmapped read is >5 log orders better than the best score of a scored mapped read). Unmapped reads are still SW-aligned to the junction-junction sequence. This should result in a rather massive speedup, so far untested.
UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)
2011-12-19 12:18:18 -05:00
Christopher Hartl 418d22b67e Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
Conflicts:
	private/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IntronLossGenotyperV2.java
2011-12-19 10:59:18 -05:00
Christopher Hartl 69661da37d Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes. 2011-12-19 10:57:28 -05:00
Laurent Francioli 16cc2b864e - Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2011-12-19 10:30:59 -05:00
Eric Banks 5fd19ae734 Commented exactly how the results are represented from the exact model so developers can know how to use them. 2011-12-19 10:19:00 -05:00
Eric Banks 3069a689fe Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case. 2011-12-19 10:04:33 -05:00
Mauricio Carneiro 5b678e3b94 Remove ClippingOp UnitTests
* all testing functionality is in the ReadClipperUnitTest, no need to double test.
* class and package naming cleanup
2011-12-19 07:49:26 -05:00
Matt Hanna 1ead00cac5 New fork of SamFileHeaderMerger should be cached at the thread level to enable fast (and valid) thread lookups. 2011-12-18 19:04:26 -05:00
Ryan Poplin bc842ab3a5 Adding option to VariantAnnotator to do strict allele matching when annotating with comp track concordance. 2011-12-18 15:27:23 -05:00
Ryan Poplin 953998dcd0 Now that getSampleDB is public in the walker base class this override in VariantAnnotator isn't necessary. 2011-12-18 14:38:59 -05:00
Eric Banks 07f9d14d9f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-18 00:43:15 -05:00
Eric Banks c5ffe0ab04 No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES. 2011-12-18 00:31:47 -05:00
Eric Banks 6dc52d42bf Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record). 2011-12-18 00:01:42 -05:00
Khalid Shakir 7486696c07 When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter.
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
2011-12-16 18:09:25 -05:00
Mauricio Carneiro fcc21180e8 Added hardClipLeadingInsertions UnitTest for the ReadClipper
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.

note: test is turned off until contract changes to allow hanging insertions (left/right).
2011-12-16 18:02:47 -05:00
Mauricio Carneiro 5bba44d693 Added hardClipByReferenceCoordinates UnitTest for the ReadClipper
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips

note: added tests to all these beautiful cases...
2011-12-16 18:01:33 -05:00
Mark DePristo 1994c3e3bc Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants 2011-12-16 16:50:51 -05:00
Mark DePristo b6067be952 Support for selecting only variants with specific IDs from a file in SelectVariants
-- Cleaned up unused variables as well
2011-12-16 16:50:39 -05:00
Mark DePristo d6d2f49c88 Don't print log if there are no BAMs 2011-12-16 16:50:36 -05:00
Mark DePristo 78e0950a77 Minor bug fix for printing in SAMDataSource 2011-12-16 11:45:40 -05:00
Mark DePristo 7bc0d18418 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-16 11:42:42 -05:00
Ryan Poplin 5aa79dacfc Changing hidden optimization argument to advanced. 2011-12-16 10:29:20 -05:00
Matt Hanna 3642a73c07 Performance improvements for dynamically merging BAMs in read walkers.
This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads.
Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with
Tim&Co to produce a more maintainable patch.
2011-12-16 09:37:44 -05:00
Mark DePristo 3414ecfe2e Restored serial version of reader initialization. Serial mode is default, as the performance gains aren't so huge.
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version

-- Comparison of serial and parallel reader with cached and uncached files:

Initialization time: serial   with 500 fully cached BAMs: 8.20 seconds
Initialization time: serial   with 500 uncached BAMs    : 197.02 seconds
Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds
Initialization time: parallel with 500 uncached BAMs    : 75.47 seconds
2011-12-16 09:22:10 -05:00
Mark DePristo fb1c9d2abc Restored serial version of reader initialization. Parallel mode is default.
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version
2011-12-16 09:05:28 -05:00
Mauricio Carneiro e61e5c7589 Refactor of ReadClipper unit tests
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Mauricio Carneiro 4748ae0a14 Bugfix: Softclips before Hardclips weren't being accounted for
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro 62a2e335bc Changing HardClipper contract to allow UNMAPPED reads
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
2011-12-15 11:08:19 -05:00
Matt Hanna 9333b678b5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 18:05:44 -05:00
Matt Hanna 6fb4be1a09 Cache header merger. 2011-12-14 18:05:31 -05:00
Mauricio Carneiro 128bdf9c09 Create artificial reads with "default" parameters
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 01e547eed3 Parallel SAMDataSource initialization
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks 35fc2e13c3 Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all. 2011-12-14 15:31:09 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Eric Banks 988d60091f Forgot to add in the new result class 2011-12-14 13:37:15 -05:00
Eric Banks 106bf13056 Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays. 2011-12-14 12:05:50 -05:00
Eric Banks 7648521718 Add check for mixed genotype so that we don't exception out for a valid record 2011-12-14 11:26:43 -05:00
Eric Banks 9497e9492c Bug fix for complex records: do not ever reverse clip out a complete allele. 2011-12-14 11:21:28 -05:00
Eric Banks 09a5a9eac0 Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number. 2011-12-14 10:43:52 -05:00
Eric Banks d3f4a5a901 Fail gracefully when encountering malformed VCFs without enough data columns 2011-12-14 10:37:38 -05:00
Eric Banks 079932ba2a The log10cache needs to be larger if we want to handle 10K samples in the UG. 2011-12-13 23:36:10 -05:00
Ryan Poplin 7fa1ab1bae Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test 2011-12-13 17:19:40 -05:00
Eric Banks e47a113c9f Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right? 2011-12-12 23:02:45 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro a70a0f25fb Better debug output for SAMDataSource
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo d03425df2f TODO optimization targets 2011-12-12 17:39:51 -05:00
Laurent Francioli 025bdfe2cc Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-12 12:19:44 +01:00
Eric Banks 7b6338c742 Merge branch 'master' into trialleles 2011-12-11 00:28:46 -05:00
Eric Banks 7c4b9338ad The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now. 2011-12-11 00:23:33 -05:00
Eric Banks 044f211a30 Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly. 2011-12-10 23:57:14 -05:00
Eric Banks 364f1a030b Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone. 2011-12-09 14:25:28 -05:00
Eric Banks 64dad13e2d Don't carry around an extra copy of the code for the Haplotype Caller 2011-12-09 11:09:40 -05:00
Eric Banks 442ceb6ad9 The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors. 2011-12-09 10:16:44 -05:00
Laurent Francioli a79144f7db Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-09 15:57:24 +01:00
Laurent Francioli 5a06170804 Corrected bug causing getChildrenWithParents() to not take the last family member into consideration. 2011-12-09 14:51:34 +01:00
Eric Banks aa4a8c5303 No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes). 2011-12-09 02:25:06 -05:00
Eric Banks 8777288a9f Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number. 2011-12-09 00:00:20 -05:00
Eric Banks 3e7714629f Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work. 2011-12-08 23:50:54 -05:00
Eric Banks 4aebe99445 Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation. 2011-12-08 15:31:02 -05:00
Eric Banks 7750bafb12 Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output. 2011-12-08 13:50:50 -05:00
Guillermo del Angel 252e0f3d0a Merged bug fix from Stable into Unstable 2011-12-08 13:11:39 -05:00
Guillermo del Angel 1bfe28067f Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc. 2011-12-08 12:54:08 -05:00
Mark DePristo 9def841275 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-07 13:36:16 -05:00
Mark DePristo 4055877708 Prints 0.0 TiTv not NaN when there are no variants
-- Updated md5
2011-12-07 12:07:54 -05:00
Matt Hanna 15533e08df Fixed issue with RODWalker parallelization.
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20.  Why didn't we see this in Stable?  Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
2011-12-07 11:55:42 -05:00
Mark DePristo 5d2212bc8e Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-07 09:03:17 -05:00
Mark DePristo 6bf18899df Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs 2011-12-07 09:02:49 -05:00
Matt Hanna c9b2cd8ba5 Fix for chartl's stale null representation issue. 2011-12-06 18:05:17 -05:00
Eric Banks 79d18dc078 Fixing indexing bug on the ACsets. Added unit tests for the Exact model code. 2011-12-06 16:17:18 -05:00
Matt Hanna f5b977fc88 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-06 10:11:35 -05:00
Matt Hanna 4001c22a11 Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup. 2011-12-06 10:10:38 -05:00
Khalid Shakir 677bea0abd Right aligning GATKReport numeric columns and updated MD5s in tests.
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks 7a0f6feda4 Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are. 2011-12-05 16:18:52 -05:00
Eric Banks 7fac4afab3 Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles). 2011-12-05 15:57:25 -05:00
Eric Banks a7cb941417 The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped. 2011-12-04 13:02:53 -05:00
Eric Banks eab2b76c9b Added loads of comments for future reference 2011-12-03 23:54:42 -05:00
Eric Banks 29662be3d7 Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them). 2011-12-03 23:12:04 -05:00
Eric Banks 71f793b71b First partially working version of the multi-allelic version of the Exact AF calculation 2011-12-02 14:13:14 -05:00
David Roazen d014c7faf9 Queue now properly escapes all shell arguments in generated shell scripts
This has implications for both Qscript authors and CommandLineFunction authors.

Qscript authors:
You no longer need to (and in fact must not) manually escape String values to
avoid interpretation by the shell when setting up Walker parameters. Queue will
safely escape all of your Strings for you so that they'll be interpreted literally. Eg.,

Old way:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")

New way:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")

CommandLineFunction authors:
If you're writing a one-off CommandLineFunction in a Qscript and don't really
care about quoting issues, just keep doing things the direct, simple way:

def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)

If you're writing a CommandLineFunction that will become part of Queue and
will be used by other QScripts, however, it's advisable to do things the
newer, safer way, ie.:

When you construct your commandLine, you should do so ONLY using the API methods
required(), optional(), conditional(), and repeat(). These will manage quoting
and whitespace separation for you, so you shouldn't insert quotes/extraneous
whitespace in your Strings. By default you get both (quoting and whitespace
separation), but you can disable either of these via parameters. Eg.,

override def commandLine = super.commandLine +
                           required("eff") +
                           conditional(verbose, "-v") +
                           optional("-c", config) +
                           required("-i", "vcf") +
                           required("-o", "vcf") +
                           required(genomeVersion) +
                           required(inVcf) +
                           required(">", escape=false) +  // This will be shell-interpreted
                           required(outVcf)

I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new
system, so you'll get free shell escaping when you use those in Qscripts just
like with walkers.
2011-12-01 18:13:44 -05:00
Mark DePristo 3060a4a15e Support for list of known CNVs in VariantEval
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Matt Hanna b65db6a854 First draft of a test script for I/O performance with the new asynchronous I/O processing.
Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag.  Will be killed when
Queue gets better support for tagged arguments (hopefully soon).
2011-11-30 13:13:16 -05:00
Laurent Francioli 1d5d200790 Cleaned up unused import statements 2011-11-30 15:30:30 +01:00
Mark DePristo 28b286ad39 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-30 09:11:53 -05:00
Laurent Francioli 20bffe0430 Adapted for the new version of MendelianViolation 2011-11-30 14:46:38 +01:00
Laurent Francioli 1cb5e9e149 Removed outdated (and unused) -familyStr commandline argument 2011-11-30 14:45:04 +01:00
Laurent Francioli f49dc5c067 Added functionality to get all children that have both parents (useful when trios are needed) 2011-11-30 14:43:37 +01:00
Laurent Francioli a4606f9cfe Merge branch 'MendelianViolation'
Conflicts:
	public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli b279ae4ead Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-30 10:10:21 +01:00
Ryan Poplin 91413cf0d9 Merged bug fix from Stable into Unstable 2011-11-29 14:01:23 -05:00
Ryan Poplin cb284eebde Further updating VQSR tutorial wiki docs to reflect the bundle 2011-11-29 14:00:57 -05:00
Ryan Poplin dcb889665d Merged bug fix from Stable into Unstable 2011-11-29 09:58:49 -05:00
Ryan Poplin 447e9bff9e Updating VQSR tutorial wiki docs to reflect the bundle 2011-11-29 09:57:45 -05:00
Ryan Poplin 110298322c Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it. 2011-11-29 09:29:18 -05:00
Laurent Francioli ab67011791 Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified 2011-11-29 11:18:15 +01:00
Eric Banks d7d8b8e380 Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now. 2011-11-28 14:18:28 -05:00
Laurent Francioli a09c01fcec Removed walker argument FamilyStructure as this is now supported by the engine (ped file) 2011-11-28 17:18:11 +01:00
Laurent Francioli 795c99d693 Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too.
MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs
2011-11-28 17:13:14 +01:00
Laurent Francioli e877db8f42 Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too. 2011-11-28 17:11:30 +01:00
Laurent Francioli 5c2595701c Added a function to get families only for a given list of samples. 2011-11-28 17:10:33 +01:00
Mark DePristo 3c36428a20 Bug fix for TiTv calculation -- shouldn't be rounding 2011-11-28 10:20:34 -05:00
Eric Banks 436b4dc855 Updated docs 2011-11-28 08:59:48 -05:00
Laurent Francioli b1dd632d5d Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
Conflicts:
	public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
2011-11-25 16:16:44 +01:00
Mark DePristo e319079c32 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-23 13:02:11 -05:00
Mark DePristo 4107636144 VariantEval updates
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
David Roazen e5b85f0a78 A toString() method for IntervalBindings
Necessary since we're currently writing things like this to our VCF headers:
intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]
2011-11-23 11:56:12 -05:00
Mark DePristo 5a4856b82e GATKReports now support a format field per column
-- You can tell the table to format your object with "%.2f" for example.
2011-11-23 11:31:04 -05:00
Mark DePristo c8bf7d2099 Check for null comment 2011-11-23 10:47:21 -05:00
Mark DePristo 6c2555885c Caching getSimpleName() in VariantEval is a big performance improvement
-- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table
2011-11-23 08:34:05 -05:00
Guillermo del Angel 32adbd614f Solve merge conflict 2011-11-22 22:48:46 -05:00
Guillermo del Angel 941f3784dc Solve merge conflict 2011-11-22 22:48:03 -05:00
Guillermo del Angel 75d93e6335 Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left 2011-11-22 22:46:12 -05:00
Mark DePristo a3aef8fa53 Final performance optimization for GenotypesContext 2011-11-22 17:19:30 -05:00
Mark DePristo 990c02e4de Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-22 17:19:11 -05:00
Guillermo del Angel 38a90da92c Fixed merge conflict to Unstable 2011-11-22 14:39:45 -05:00
Guillermo del Angel 32a77a8a56 Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle. 2011-11-22 13:57:24 -05:00
Eric Banks 5821c11fad For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead. 2011-11-22 10:50:22 -05:00
Mark DePristo 7087310373 Embarassing bug fixed 2011-11-22 10:16:36 -05:00
Mark DePristo e484625594 GenotypesContext now updates cached data for add, set, replace operations when possible
-- Involved separately managing the sample -> offset and sample sorted list operations.  This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo 2b51c01df4 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-21 19:16:06 -05:00
Mark DePristo 5443d3634a Again, fixing the add call when we really mean replace
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./.  Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values.  There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00
Mauricio Carneiro 5ad3dfcd62 BugFix: byte overflow in SyntheticRead compressed base counts
* fixed and added unit test
2011-11-21 17:11:50 -05:00
Mark DePristo 9ea7b70a02 Added decode method to LazyGenotypesContext
-- AbstractVCFCodec calls this if the samples are not sorted.  Previously called getGenotypes() which didn't actually trigger the decode
2011-11-21 16:21:23 -05:00
Mark DePristo ab2efe3bd3 Reverting bad exact model changes 2011-11-21 16:14:40 -05:00
Eric Banks 44554b2bfd Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-21 15:01:45 -05:00
Eric Banks 022832bd74 Very bad use of the == operator with Strings was ensuring that validating GenomeLocs was very inefficient. This fix resulted in a significant speedup for a simple RodWalker. 2011-11-21 14:49:47 -05:00
Mark DePristo 1561af22af Exact model code cleanup
-- Fixed up code when fixing a bug detected by aggressive contracts in GenotypesContext.
2011-11-21 14:35:15 -05:00
Mark DePristo 2c501364b8 GenotypesContext no longer have immutability in constructor
-- additional bug fixes throughout VariantContext and GenotypesContext objects
2011-11-21 14:34:31 -05:00
David Roazen 1296dd41be Removing the legacy -L "interval1;interval2" syntax
This syntax predates the ability to have multiple -L arguments, is
inconsistent with the syntax of all other GATK arguments, requires
quoting to avoid interpretation by the shell, and was causing
problems in Queue.

A UserException is now thrown if someone tries to use this syntax.
2011-11-21 13:18:53 -05:00
Mark DePristo e467b8e1ae More contracts on LazyGenotypesContext 2011-11-21 09:34:57 -05:00
Mark DePristo 2e9ecf639e Generalized interface to LazyGenotypesContext
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself.  The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo bc44f6fd9e Utility function Collection<Genotype> -> Collection<String> 2011-11-20 18:26:56 -05:00
Mark DePristo 9445326c6c Genotype is Comparable via sampleName 2011-11-20 18:26:27 -05:00
Mark DePristo f9e25081ab Completed documented LazyGenotypesContext 2011-11-20 08:35:52 -05:00
Mark DePristo 9cb3fe3a59 Vastly better way of doing on-demand genotyping loading
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency.  This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo f392d330c3 Proper use of builder. Previous conversion attempt was flawed 2011-11-19 22:09:56 -05:00
Mark DePristo 7d09c0064b Bug fixes and code cleanup throughout
-- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.
2011-11-19 18:40:15 -05:00
Mark DePristo 8f7eebbaaf Bugfix for pError not being checked correctly in CommonInfo
-- UnitTests to ensure correct behavior
-- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered
2011-11-19 15:58:59 -05:00
Mark DePristo b7b57ef39a Updating MD5 to reflect canonical ordering of calculation
-- We should no longer have md5s changing because of hashmaps changing their sort order on us
-- Added GenotypeLikelihoodsUnitTests
-- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.
2011-11-19 15:57:33 -05:00
Mark DePristo 73119c8e3c Merge with master
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo f685fff79b Killing the final versions of old new VariantContext interface 2011-11-18 21:32:43 -05:00
Mark DePristo 6cf315e17b Change interface to getNegLog10PError to getLog10PError 2011-11-18 21:07:30 -05:00
Mark DePristo c7f2d5c7c7 Final minor fix to contract 2011-11-18 19:40:05 -05:00
Mauricio Carneiro b5de182014 isEmpty now checks if mReadBases is null
Since newly created reads have mReadBases == null. This is an effort to centralize the place to check for empty GATKSAMRecords.
2011-11-18 18:34:05 -05:00
Mauricio Carneiro 8ab3ee9c65 Merge remote-tracking branch 'unstable/master' into rr 2011-11-18 16:50:25 -05:00
Mauricio Carneiro 333e5de812 returning read instead of GATKSAMRecord
Do not create new GATKSAMRecord when read has been fully clipped, because it is essentially the same as returning the currently fully clipped read.
2011-11-18 16:49:59 -05:00
Matt Hanna 8bb4d4dca3 First pass of the asynchronous block loader.
Block loads are only triggered on queue empty at this point.  Disabled by
default (enable with nt:io=?).
2011-11-18 15:02:59 -05:00
Mark DePristo a2e79fbe8a Fixes to contracts 2011-11-18 14:18:53 -05:00
Mark DePristo 660d6009a2 Documentation and contracts for GenotypesContext and VariantContextBuilder 2011-11-18 13:59:30 -05:00
Mark DePristo f54afc19b4 VariantContextBuilder
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples.   Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext.  Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Eric Banks 6459784351 Merged bug fix from Stable into Unstable 2011-11-18 12:34:57 -05:00
Eric Banks c62082ba1b Making this class public again as per request from Cancer folks 2011-11-18 12:34:27 -05:00
Eric Banks 8710673a97 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-18 12:29:33 -05:00
Eric Banks 768b27322b I figured out why we were getting tons of hom var genotype calls with Mauricio's low quality (synthetic) reduced reads: the RR implementation in the UG was not capping the base quality by the mapping quality, so all the low quality reads were used to generate GLs. Fixed. 2011-11-18 12:29:15 -05:00
Mark DePristo 7490dbb6eb First version of VariantContextBuilder 2011-11-18 11:06:15 -05:00
Roger Zurawicki f48d4cfa79 Bug fix: fully clipping GATKSAMRecords and flushing ops
Reads that are emptied after clipping become new GATKSAMRecords.
When applying ClippingOps, the ops are cleared after the clipping
2011-11-18 00:24:39 -05:00
Mark DePristo fa454c88bb UnitTests for VariantContext for chrCount, getSampleNames, Order function
-- Major change to how chromosomeCounts is computed.  Now NO_CALL alleles are always excluded.  So ChromosomeCounts(A/.) is 1, the previous result would have been 2.
-- Naming changes for getSamplesNameInOrder()
2011-11-17 20:37:22 -05:00
Mark DePristo 23359d1c6c Bugfix for pruneVariantContext, which was dropping the ref base for padding 2011-11-17 15:32:52 -05:00
Mark DePristo 473b860312 Major determinism fix for UG and RankSumTest
-- Now these routines all iterate in sample name order (genotypes.iterateInSampleNameOrder) so that the results of UG and the annotator do not depend on the particular order of samples we see for the exact model and the RankSumTest
2011-11-17 15:31:45 -05:00
Khalid Shakir c50274e02e During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice.
Moved flanking interval utilies to IntervalUtils with UnitTests.
2011-11-17 13:56:42 -05:00
Eric Banks 16a021992b Updated header description for the INFO and FORMAT DP fields to be more accurate. 2011-11-17 13:17:53 -05:00
Eric Banks e7d41d8d33 Minor cleanup 2011-11-17 12:00:28 -05:00
Mark DePristo 7e66677769 Expanded UnitTests for VariantContext
Tests for
-- getGenotype and getGenotypes
-- subContextBySample
-- modify routines
2011-11-16 20:45:15 -05:00
Mark DePristo aa0610ea92 GenotypeCollection renamed to GenotypesContext 2011-11-16 16:24:05 -05:00
Mark DePristo caf6080402 Better algorithm for merging genotypes in CombineVariants 2011-11-16 15:17:33 -05:00
Mark DePristo e56d52006a Continuing bugfixes to get new VC working 2011-11-16 10:39:17 -05:00
Matt Hanna eb8e031f75 Merged bug fix from Stable into Unstable 2011-11-16 09:57:37 -05:00
Matt Hanna 6a5d5e7ac9 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/stable 2011-11-16 09:57:13 -05:00
Matt Hanna 7ac5cf8430 Getting rid of unsupported CountReadPairs walker in stable. Removal of
remainder of pairs processing framework to follow in unstable.
2011-11-16 09:53:59 -05:00
Eric Banks c2ebe58712 Merge remote-tracking branch 'Laurent/master' 2011-11-16 09:34:47 -05:00
Laurent Francioli 0dc3d20d58 Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type 2011-11-16 09:33:13 +01:00
Laurent Francioli 7d77fc51f5 Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type 2011-11-16 03:32:43 -05:00
David Roazen 0d163e3f52 SnpEff 2.0.4 support
-Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format
-Assigning functional classes and effect impacts now handled directly
 by SnpEff rather than the GATK
-Removed support for SnpEff 2.0.2, as we no longer trust the output of that
 version since it doesn't exclude effects associated with certain nonsensical
 transcripts. These effects are excluded as of 2.0.4.
-Updated unit and integration tests

This support is based on a *release-candidate* of SnpEff 2.0.4, and so is subject
to change between now and the next GATK release.
2011-11-15 18:36:22 -05:00
Mark DePristo df415da4ab More bug fixes on the way to passing all tests 2011-11-15 17:38:12 -05:00
Mark DePristo 0be23aae4e Bugfixes on way to a working refactored VariantContext 2011-11-15 17:20:14 -05:00
Mark DePristo 231c47c039 Bugfixes on way to a working refactored VariantContext 2011-11-15 16:42:50 -05:00
Laurent Francioli fb685f88ec Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-15 16:23:53 -05:00
Mark DePristo 2b2514dad2 Moved many unused phasing walkers and utilities to archive 2011-11-15 16:14:50 -05:00
Mark DePristo 460a51f473 ID field now stored in the VariantContext itself, not the attributes 2011-11-15 14:56:33 -05:00
Eric Banks b45d10e6f1 The DP in the FORMAT field (per sample) must also use the representative count or else it's always 1 for reduced reads. 2011-11-15 10:23:59 -05:00
Mark DePristo 233e581828 Merging in Master 2011-11-15 09:28:24 -05:00
Eric Banks b66556f4a0 Update error message so that it's clear ReadPair Walkers are exceptions 2011-11-15 09:22:57 -05:00
Mark DePristo 6e1a86bc3e Bug fixes to VariantContext and GenotypeCollection 2011-11-15 09:21:30 -05:00
Mauricio Carneiro cde829899d compress Reduce Read counts bytes by offset
compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size.

Example compression -->

from : 10, 10, 11, 11, 12, 12, 12, 11, 10
to:      10, 0, 1, 1,2, 2, 2, 1, 0
2011-11-14 18:30:24 -05:00
Mark DePristo f0234ab67f GenotypeMap -> GenotypeCollection part 2
-- Code actually builds
2011-11-14 17:42:55 -05:00
David Roazen ab0ee9b847 Perform only necessary validation in VariantContext modify methods 2011-11-14 16:49:59 -05:00
Mark DePristo 2e9d5363e7 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-14 15:32:06 -05:00
Mark DePristo 1fbdcb4f43 GenotypeMap -> GenotypeCollection 2011-11-14 15:32:03 -05:00
Eric Banks 4dc9dbe890 One quick fix to previous commit 2011-11-14 14:42:12 -05:00
Eric Banks 7b2a7cfbe7 Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes. 2011-11-14 14:31:27 -05:00
Mark DePristo 9b5c79b49d Renamed InferredGeneticContext to CommonInfo
-- I have no idea why I named this InferredGeneticContext, a totally meaningless term
-- Renamed to CommonInfo.
-- Made package protected, as no one should use this outside of VariantContext and Genotype
-- UGEngine was using IGC constant, but it's now using the public one in VariantContext.
2011-11-14 14:28:52 -05:00
Mark DePristo 077397cb4b Deleted MutableVariantContext
-- All methods that used this capable now use VariantContext directly instead
2011-11-14 14:19:06 -05:00
Mark DePristo b11c535527 Deleted MutableGenotype
-- This class wasn't really used anywhere, and so removed to control code bloat.
2011-11-14 13:16:36 -05:00
Mark DePristo 79987d685c GenotypeMap contains a Map, not extends it
-- On path to replacing it with GenotypeCollection
2011-11-14 12:55:03 -05:00
Eric Banks 7aee80cd3b Fix to deal with reduced reads containing a deletion 2011-11-14 12:23:46 -05:00
Eric Banks 3d2970453b Misc minor cleanup 2011-11-14 09:41:54 -05:00
Laurent Francioli 1347beef40 Merge branch 'PhaseByTransmission' 2011-11-14 11:31:28 +01:00
Eric Banks b7c33116af Minor docs update 2011-11-12 23:21:07 -05:00
Eric Banks 76d357be40 Updating docs example to use -L since that's best practice 2011-11-12 23:20:05 -05:00
Mark DePristo fee9b367e4 VariantContext genotypes are now stored as GenotypeMap objects
-- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples
-- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type.
-- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous.  Now everything uses GenotypeMap with a specific ordering of samples (by name)
-- Integrationtests updated and all pass
2011-11-11 15:00:35 -05:00
Guillermo del Angel cd3146f4cf Add hidden option to ValidationAmplicons to output slightly modified format to make file work with downstream SQNM tools more seamlessly at request of GAP: one line per record, keep probe identifier to 20 characters, no * in ref allele. 2011-11-11 14:07:07 -05:00
Ryan Poplin 40fbeafa37 VQSR will now detect if the negative model failed to converge properly because of having too few data points and automatically retry with more appropriate clustering parameters. 2011-11-11 11:52:30 -05:00
Mark DePristo ef9f8b5d46 Added subContextOfSamples to VariantContext
-- This is a more convenient accesssor than subContextOfGenotypes, represents nearly all of the use cases of the former function, and potentially can be implemented more efficiently.
2011-11-11 10:07:11 -05:00
Mark DePristo ee40791776 Attributes are now Map<String,Object> not Map<String,?>
-- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).
2011-11-11 09:55:42 -05:00
Mark DePristo dc9b351b5e Meaningful error message when an IntervalArg file fails to parse correctly 2011-11-10 17:10:26 -05:00
Mark DePristo bb7bf74aa8 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-10 16:05:43 -05:00
Mauricio Carneiro 060c7ce8ae It wouldn't harm integrationtests if we had our logic right... :-) 2011-11-10 14:03:22 -05:00
Eric Banks 39678b6a20 Check for reads with missing read groups and throw a UserException when encountered. Mauricio said this wouldn't break integration tests. 2011-11-10 13:34:45 -05:00
Mark DePristo dd1810140f -stratIntervals is optional 2011-11-10 13:27:32 -05:00
Mark DePristo 67b022c34b Cleanup for new SampleUtils function
-- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase
2011-11-10 13:27:13 -05:00
Mark DePristo 35fe9c8a06 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-10 11:11:33 -05:00
Mark DePristo dc4932f93d VariantEval module to stratify the variants by whether they overlap an interval set
The primary use of this stratification is to provide a mechanism to divide asssessment of a call set up by whether a variant overlaps an interval or not.  I use this to differentiate between variants occurring in CCDS exons vs. those in non-coding regions, in the 1000G call set, using a command line that looks like:

-T VariantEval -R human_g1k_v37.fasta -eval 1000G.vcf -stratIntervals:BED ccds.bed -ST IntervalStratification

Note that the overlap algorithm properly handles symbolic alleles with an INFO field END value.  In order to safely use this module you should provide entire contigs worth of variants, and let the interval strat decide overlap, as opposed to using -L which will not properly work with symbolic variants.

Minor improvements to create() interval in GenomeLocParser.
2011-11-10 10:58:40 -05:00
Mauricio Carneiro 0d8983feee outputting the RG information
setReadGroup now sets the read group attribute for the GATKSAMRecord
2011-11-09 23:35:00 -05:00
Eric Banks 315ac68b0b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-09 22:37:36 -05:00
Eric Banks 6313aae2c4 Adding checks for hasBasePileup() before calling getBasePileup() as per GS thread 2011-11-09 22:37:26 -05:00
Ryan Poplin 74a18d3de8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-09 22:29:40 -05:00
Ryan Poplin 24712c0221 Merged bug fix from Stable into Unstable 2011-11-09 22:28:27 -05:00
Ryan Poplin 8942406aa2 Use MathUtils to compare doubles instead of testing for equality 2011-11-09 22:05:21 -05:00
Ryan Poplin 348f2db7fd Fix for HMM optimization. If the two penalty arrays match exactly the function should return the end of the array instead of 0. 2011-11-09 22:00:52 -05:00
Eric Banks 82bf09edf3 Mark Standard Annotations with an asterisk 2011-11-09 20:42:31 -05:00
Eric Banks 04b122be29 Fix for bug reported on GetSatisfaction 2011-11-09 20:33:36 -05:00
Mauricio Carneiro d00b2c6599 Adding a synthetic read for filtered data
* Generalized the concept of a synthetic read to cread both running consensus and a synthetic reads of filtered data.
* Synthetic reads can now have deletions (but not insertions)
* New reduced read tag for filtered data synthetic reads *(RF)*
* Sliding window header now keeps information of consensus and filtered data
* Synthetic reads are created simultaneously, new functionality is controlled internally by addToSyntheticReads
2011-11-09 20:16:22 -05:00
Eric Banks 21bf43f3bb Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-09 15:34:40 -05:00
Christopher Hartl 85bffe1dca Merged bug fix from Stable into Unstable 2011-11-09 15:29:14 -05:00
Christopher Hartl d828eba7f4 Allow comments in a table-formatted file to precede the header line. 2011-11-09 15:27:38 -05:00
Eric Banks 8205efbb29 Merge branch 'master' into intervals 2011-11-09 15:27:15 -05:00
Eric Banks d64f8a89a9 Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly. 2011-11-09 15:24:29 -05:00
Mauricio Carneiro f080f64f99 Preserve RG information on new GATKSAMRecord from SAMRecord 2011-11-09 14:39:20 -05:00
Mauricio Carneiro f9530e0768 Clean unnecessary attributes from the read
this gives on average 40% file size reduction.
2011-11-09 14:39:20 -05:00
Mauricio Carneiro 9427ada498 Fixing no cigar bug
empty GATKSAMRecords will have a null cigar. Treat them accordingly.
2011-11-09 14:39:20 -05:00
Mark DePristo e639f0798e mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set
-- A bit of code cleanup in VCFUtils
-- VariantEval table to create 1000G Phase I variant summary table
-- First version of 1000G Phase I summary table Qscript
2011-11-09 14:35:50 -05:00
Christopher Hartl 149b79eaad Merged bug fix from Stable into Unstable 2011-11-09 11:26:30 -05:00
Christopher Hartl 11abb4f9d1 Better error message. 2011-11-09 11:25:28 -05:00
Christopher Hartl d3a533b82e Revert "a"
This reverts commit 1175f50ddbf389f5da74d27dc725596582ae15af.
2011-11-09 11:22:26 -05:00
Christopher Hartl 5eaf800281 a 2011-11-09 11:22:20 -05:00
Christopher Hartl 5451fbc2b2 Merged bug fix from Stable into Unstable 2011-11-09 11:06:15 -05:00
Christopher Hartl 091229e4db MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods.
Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten.
Table feature now has a toString method.

These are minor bug fixes.
2011-11-09 11:03:29 -05:00
Mauricio Carneiro e1b4c3968f Fixing GATKSAMRecord bug
when constructing a GATKSAMRecord from scratch, we should set "mRestOfBinaryData" to null so the BAMRecord doesn't try to retrieve missing information from the non-existent bam file.
2011-11-08 16:50:36 -05:00
Ryan Poplin e973ca2010 fixing merge conflict. 2011-11-08 14:55:05 -05:00
Ryan Poplin b0e6afec48 Bug fix for HMM optimization. Need to also check the gap continuation penalty array for the index with the first discrepancy. 2011-11-08 14:51:25 -05:00
Laurent Francioli 571c724cfd Added reporting of the number of genotypes updated. 2011-11-08 15:15:51 +01:00
Ryan Poplin 94dc447a70 Merged bug fix from Stable into Unstable 2011-11-07 15:26:35 -05:00
Ryan Poplin 0b181be61f Bug fix in SelectVariants when using a discordance track but no sample specifications. Added integration test to test this. 2011-11-07 15:25:16 -05:00
Ryan Poplin 0534149708 Merged bug fix from Stable into Unstable 2011-11-07 14:07:08 -05:00
Ryan Poplin 2d1e385ca4 Adding note to VQSR docs about Rscript being needed in the environment PATH. 2011-11-07 14:04:13 -05:00
Eric Banks 759f4fe6b8 Moving unclaimed walker with bad integration test to archive 2011-11-07 13:16:38 -05:00
Eric Banks c1986b6335 Add notes to the GATKdocs as to when a particular annotation can/cannot be calculated. 2011-11-07 11:06:19 -05:00
Eric Banks 724e3f3b0d Merged bug fix from Stable into Unstable 2011-11-06 22:23:22 -05:00
Eric Banks cdd40d1222 Removing contracts for the SimpleTimer 2011-11-06 22:22:49 -05:00
Ryan Poplin 5c565d28b9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-06 10:26:19 -05:00
Eric Banks 1c4e429a1c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-06 00:05:56 -04:00
Eric Banks a12bc63e5c Get rid of support for bams without sample information in the read groups. This hidden option wasn't being used anyways because it wasn't hooked up properly in the AlignmentContext. 2011-11-05 23:54:28 -04:00
Eric Banks 90a053ea93 Don't change the mapping quality of MQ=255 reads in IR 2011-11-05 22:40:45 -04:00
Ryan Poplin 611a395783 Now properly extending candidate haplotypes with bases from the reference context instead of filling with padding bases. Functionality in the private Haplotype class is no longer necessary so removing it. No need to have four different Haplotype classes in the GATK. 2011-11-05 12:18:56 -04:00
Mark DePristo e99871f587 Bug fix for decode loc
-- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.
2011-11-04 13:20:54 -04:00
Mark DePristo a340a1aeac Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing
due to malformed VCF files.
2011-11-04 11:44:24 -04:00
Mark DePristo 9f260c0dc1 Zero byte index bug fix for RandomlySplitVariants + cleanup
-- vcfWriter2 was never being closed in onTraversalDone(), so the on the fly index file was being created but never actually properly written to the file.

-- This bug is ultimately due to the inability of the GATK to allow multiple VCF output writers as @Output arguments, though

-- Removed the unnecessary local variable iFraction, = 1000 * the input fraction argument.  Now the system just uses a double random number and compares to the input fraction at all.  Is there some subtle reason I don't appreciate for this programming construct?
2011-11-04 09:45:20 -04:00
Mauricio Carneiro e89ff063fc GATKSAMRecord refactor
The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...).

* No tools should create SAMRecord anymore, use GATKSAMRecord instead *
2011-11-03 15:43:26 -04:00
Laurent Francioli 385a6abec1 Fixed a bug that wrongly swapped the mother and father genotypes in case the child genotype missing. 2011-11-03 13:04:53 +01:00
Laurent Francioli 893787de53 Functions getAsMap and getNegLog10GQ now handle missing genotype case. 2011-11-03 13:04:11 +01:00
Eric Banks e8bceb1eaa Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-02 21:13:54 -04:00
Eric Banks 52b16bf739 Must check whether there's a normal vs. extended pileup before asking for it. 2011-11-02 20:45:24 -04:00
Eric Banks e1edd6bd12 Removing the min mapping quality argument since it wasn't being used in the normal processing of the pileups in UG - only for indel pileups. Instead, we apply the min base quality to the reads in the pileup for indels and define it to be the min 'confidence' of the base. Docs are updated but I didn't rename the argument as I don't want people to complain. 2011-11-02 20:32:58 -04:00
Ryan Poplin e94fcf537b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-02 16:29:19 -04:00
Ryan Poplin 4d35272916 Bug fixes with Mauricio to functions in ReadUtils used by reduced reads and the haplotype caller. 2011-11-02 16:29:10 -04:00
Mark DePristo 8a2929c1dd Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-02 16:21:00 -04:00
Laurent Francioli 19ad5b635a - Calculation of parent/child pairs corrected
- Separated the reporting of single and double mendelian violations in trios
2011-11-02 18:35:31 +01:00
Eric Banks 967ff647b8 Reduced reads shouldn't contribute to Fisher Strand calculations 2011-11-02 13:07:20 -04:00
Eric Banks cf0e699226 QualByDepth was inefficiently iterating over the pileup 2 times for some reason. Removed non-useful annotation classes. 2011-11-02 12:58:38 -04:00
Eric Banks 4501dce58d Fixing merge conflict 2011-11-02 12:50:32 -04:00
Eric Banks 54331b44e9 New way of looking at the size of a pileup: there's a physical number of elements in the data structure and there's a representative depth of coverage (since a reduced read represents depth >= 1). The size() method has been removed because its meaning is ambiguous. Updated several annotations and the UG engine to make use of the representative depths. 2011-11-02 12:47:30 -04:00
Mark DePristo c2b97030a4 IntervalUtils for completely balanced locus-based scatter/gather
-- scatterLocusIntervals master utility
-- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc
-- Util function for reversing a list (List<T> -> List<T>, unlike Collections version)
-- DoC is PartitionType.INTERVAL
-- Significant unit tests on new functionality (all passing)
-- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work
2011-11-02 10:49:40 -04:00
Laurent Francioli 119ca7d742 Fixed a bug in parent/child pairs reporting causing a crash in case the -mvf option was used and mother was not provided 2011-11-02 08:22:33 +01:00
Laurent Francioli b91a9c4711 - Fixed parent/child pairs handling (was crashing before)
- Added parent/child pair reporting
2011-11-02 08:04:01 +01:00
Mark DePristo 5fc613f972 Better default partition types for walkers
-- Added PartitionType.READ, and associated ReadScatterFunction.  ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better
-- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.
2011-11-01 19:47:10 -04:00
Mauricio Carneiro 36600fd8e9 added MQ of low MQ/BQ to consensus RMS
Bases that were excluded for MQ and BQ filters are now contributing to the MQ RMS (but not to consensus base counts and variant/not variant region triggers).
2011-11-01 17:46:12 -04:00
Mauricio Carneiro b004489c6d Moving ReduceRead TAG to GATKSAMRecord
ReduceReads are now a feature of a GATKSAMRecord, so the tag and the special methods needed to use it will now be housed by the GATKSAMRecord.
2011-11-01 17:12:09 -04:00
Mauricio Carneiro 17cc484dbd Revert "ReduceReads ref bases are now output as '='
Reducing the reference bases to '=' results in an extra compression of 13% on average. The GATK is not ready to handle files with '=' bases, and the decision was to implement this a an engine support, not a part of ReduceReads.
2011-11-01 16:35:07 -04:00
Eric Banks 0839c75c8d More minor fixes to docs 2011-10-31 21:49:27 -04:00
Eric Banks 74b018a1f3 Minor fixes to docs 2011-10-31 21:41:43 -04:00
Eric Banks 31ee5432c5 Merged bug fix from Stable into Unstable 2011-10-31 14:56:59 -04:00
David Roazen cdde32acbd Merged bug fix from Stable into Unstable 2011-10-31 14:21:15 -04:00
Eric Banks f62af0291b Check for invalid VCF records (not enough tokens) instead of assuming they are there. 2011-10-31 14:09:51 -04:00
Andrey Sivachenko bed0acaed4 nWayOut now adds PG tag to the header as it should. Also, additional hidden option added: keepPGTags. If invoked, IndelRealigner PG tags from previous runs (if any) are kept in the header and the new PG tag is simply added, instead of overriding them 2011-10-31 12:28:28 -04:00
Mauricio Carneiro 389380a590 ReduceReads ref bases are now output as '=' to save space
Restructured the sliding window framework to manipulate a wrapped version of the SAMRecord that contains information about the reference.
2011-10-30 12:04:39 -04:00
Eric Banks 0ca7428e76 Allow processing of empty intervals, but warn user when this case is encountered. 2011-10-28 12:12:14 -04:00
Eric Banks 649dfe98f0 Add VCF header for any expressions that are requested 2011-10-28 10:22:19 -04:00
Eric Banks 057a79f598 This argument should be annotated as @Input 2011-10-28 09:44:49 -04:00
Eric Banks 4ba7c0cecd Moving to private 2011-10-28 09:29:28 -04:00
Eric Banks 1bdd76c2f2 These tools now use the IntervalBinding system to handle intervals instead of doing it all manually 2011-10-28 09:28:12 -04:00
Eric Banks 6ba08a103d Empty ROD files should generate an exception when used for creating intervals. Moved some now obsolete files to the archive as the realigner will now read all target intervals into memory. 2011-10-28 09:23:25 -04:00
Eric Banks 3d04bb5608 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-27 23:55:18 -04:00
Eric Banks 19e27d4568 Removing all instances of -BTI (in tests and in GATKdocs) and replacing them with the appropriate alternative. 2011-10-27 23:55:11 -04:00
Eric Banks cafc245a43 For some reason, a class of Codecs (including TableCodec) require that a GenomeLocParser be passed in to do the position processing. Why can't they just return a Feature with chr, start, stop? Isn't that the right thing? 2011-10-27 23:54:28 -04:00
Guillermo del Angel cbc43683ee Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-27 20:54:18 -04:00
Guillermo del Angel 8907e42007 First fully functional implementation of ValidationSiteSelectorWalker. User gives a) a set of input variants, b) a desired number of output variants, b) Optionally, a set of samples which will restrict sites to be polymorphic in those samples, c) a frequency selection mode: either uniform (no AF matching), or matching AF so that output sites mirror the input AF spectrum as closely as possible.
More testing is needed and docs need improving but so far all functionality seems up and running
2011-10-27 20:53:48 -04:00
Eric Banks ccfd853b34 Added further integration tests for rod-based intervals that deal with more complex cases. Good call by Mark to test the empty VCF example because we were failing on it; fixed. 2011-10-27 20:43:50 -04:00
Eric Banks c2f343773e Oops, working too quickly last time. This is the proper fix for the potential NPE in the equals() test. 2011-10-27 15:32:08 -04:00
Khalid Shakir b80d407dc7 No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path.
Other minor cleanup.
2011-10-27 14:17:07 -04:00
Eric Banks 8c4dbce6d8 Don't serialize the GATKArgumentCollection for the GATKRunReports (which would have meant dealing with the new IntervalBindings). Also, forgot to remove a test that's no longer relevant to BED parsing. 2011-10-27 13:58:19 -04:00
Eric Banks 4a7e6fee3f Remove support for BED file interval parsing in the GATK; it should all go through Tribble now. IndelRealigner no longer supports unordered interval input (which shouldn't have been used anyways). Temporarily commenting out serialization of arguments so that tests pass; this whole piece will be deleted soon anyways. 2011-10-27 13:38:08 -04:00
Matt Hanna f7df8bdecc Merged bug fix from Stable into Unstable 2011-10-27 11:31:17 -04:00
Matt Hanna 41ddc7bce7 Make sure we output a full stack trace when we encounter Tribble error messages on VCF header merge. 2011-10-27 11:30:04 -04:00
Eric Banks 44f905b5e5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-26 23:31:11 -04:00
Eric Banks 68283b1651 Fixing docs and adding GATKdocs for the new interval functionality 2011-10-26 22:14:43 -04:00
Mark DePristo c9978316a3 Merge branch 'FragmentUtils' 2011-10-26 19:51:49 -04:00
Mauricio Carneiro add9ad97ec No scatter gather for VQSR or ApplyVQSR.
These walkers should not be scatter gatherable. Annotating them accordingly so that Queue doesn't allow a less than knowledgeable user to try and scatter/gather VQSR.
2011-10-26 16:35:44 -04:00
Ryan Poplin 74aeb22eeb Merged bug fix from Stable into Unstable 2011-10-26 15:57:30 -04:00
Ryan Poplin 86871bd1e3 Throw a UserException in the BQSR when there is no data instead of creating an empty csv file 2011-10-26 15:56:41 -04:00
Mark DePristo 034a997d07 Generalized Reads -> Fragment calculation
-- Supports ReadBackedPileup -> FragmentCollection as before
-- Added support for List<SAMRecord> -> FragmentCollection for Ryan's haplotype caller
-- General cleanup, renaming, move to separate package, more extensive unit tests, etc.
-- Added toFragment() function to ReadBackedPileup interface
2011-10-26 15:54:38 -04:00
Eric Banks 2f21b6ecfb Removed debugging output 2011-10-26 15:50:20 -04:00
Eric Banks b39fcb1bea Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-26 15:44:25 -04:00
Eric Banks b6ce6ed3f8 Go around the ROD system for now so that we can just call decodeLoc() for efficiency. Noted that we should go through the ROD system once it gets cleaned up. This means that currently gzipped files are not supported with -L. 2011-10-26 15:42:53 -04:00
Eric Banks 9424e8b2ca Initial working version of new interval system in which the argument for -L (and -XL) is allowed to be a rod file (e.g. VCF). Old samtools-style intervals still behave as before. BTI is no longer supported. The merging (union or intersection) of intervals is now consistently applied to all -L (or -XL) intervals, which is nice. More testing needed. 2011-10-26 14:11:49 -04:00
Mark DePristo 7fa943aef1 Renamed FragmentPileup to FragmentUtils 2011-10-26 14:01:45 -04:00
Laurent Francioli 1f044faedd - Genotype assignment in case of equally likeli combination is now random
- Genotype combinations with 0 confidence are now left unphased
2011-10-26 19:57:09 +02:00
Laurent Francioli 81b163ff4d Indentation 2011-10-26 14:49:12 +02:00
Laurent Francioli 62cff266d4 GQ calculation corrected for most likely genotype 2011-10-26 14:40:04 +02:00
Mark DePristo af3613cc5f GATKSAMRecord commit branch summary
First, I'm sure there's a better way to do this, but I wanted to create a single commit summarizing the changes from my branch SamRecordFactory.  What's the best way to do this?  Rebase?

Now, on to the changes here:

-- Picard added a SamRecordFactory that is used to create instances the subclass SamRecord or BAMRecord.  This factory allows us to have low-level picard readers (SamFileReader) create objects of type GATKSamRecord.  The abomination of the extends and contains GATKSamRecord is now gone.  GATKSamRecords are now produced by this factory, the GATK provides this factory to our SamFileReaders, and everything works with GATKSamRecord just extending BAMRecord.  This results in up to a 2x performance improvement in writing BAMs and a ~10% improvement when reading BAMs files.

-- As a consequence of this, we no longer officially support SAM records.  Attempting to create SAMRecord objects with the factory will throw a user exception.

-- Created a standard NGSPlatform enum, and GATKSamRecords support efficiently obtaining this value.  The real BQSR (not the copy indel version) got the efficient code to use this.  Please add all future platforms to this enum.

-- GATKSamRecord no longer supports using the OQ or defaultBaseQuality.  This is performed in a wrapper iterator that's only added when these command line options are used.

-- ReducedRead code has been moved from ReadUtils until efficiency caching assessors in GATKSamRecord.

-- ArtificialSamUtils creates GATKSamRecords now, just SAMRecords.  Added code here to create artifical pairs and using that code to create artificial ReadBackedPileups with specific properties

-- New smarter algorithm for FragmentPileup.  This new code is up to 3x faster than the previous version, and is lazy so is more efficient when no overlapping pairs are actually in the pileup.  Created extensive DataProvider driven UnitTest.  Added Caliper-based benchmarking system to characterize the performance differences between the old and new algorithms.  TODO still remains to make a efficient version that works for non-pileups for the HaplotypeCaller
2011-10-25 20:52:56 -04:00
Mark DePristo 2822f0dc27 Merge branch 'SamRecordFactory' 2011-10-25 20:34:47 -04:00
Mark DePristo 1b722c21cf merge master 2011-10-25 16:08:39 -04:00
Ryan Poplin 56fdf0b865 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-25 15:58:56 -04:00
Ryan Poplin 4a34c1862e misc cleanup. We now filter out haplotypes when it is obvious that the assembly has failed to find a parsimonious event rather than use haplotypes with large numbers of SNPs and small indels on them. 2011-10-25 15:22:28 -04:00
Guillermo del Angel b559936b7a a)New variant eval stratification module for indel size. b) Next iteration on indel caller runtime optimization: when computing likelihood of each haplotype for a given read, many computations will be redundant since pieces of haplotypes will be common to both REF and ALT haplotypes. So, we keep HMM matrices from one haplotype to the next one and recompute starting at the part where either haplotype is different or GOP/GCP are different. 2011-10-25 09:56:43 -04:00
Khalid Shakir fac9932938 Embedding gsalib source and queueJobReport R scripts in the dist and package jars.
Moved gsalib and queueJobReport.R to embeddable namespaced locations.
Updated packager dependencies/dir to add an @includes which filters the embedded fileset.
RScriptExecutor can now JIT compiles the gsalib.
RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG.
Refactored ProcessController and IOUtils from Queue to Sting Utils.
Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count.
Replaced uses of some IOUtils with Apache Commons IO.
ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown.
Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().
2011-10-24 15:58:34 -04:00
Khalid Shakir 89a581a66f Added ability to specify arguments in files via -args/--arg_file
Pushing back downsample and read filter args so they show up in getApproximateCommandLineArgs()
2011-10-24 15:58:34 -04:00
Mark DePristo 502592671d Cleanup FragmentPileup before main repo commit
-- removed intermiate functions.  Now only original version and best optimized new version remain
-- Moved general artificial read backed pileup creation code into ArtificialSamUtils
2011-10-24 14:40:05 -04:00
Mark DePristo 166174a551 Google caliper example execution script
-- FragmentPileup with final performance testing
2011-10-24 14:04:53 -04:00
Laurent Francioli 62477a0810 Added documentation and comments 2011-10-24 13:45:21 +02:00
Laurent Francioli 38ebf3141a - Now supports parent/child pairs
- Sites with missing genotypes in pairs/trios are handled as follows:
-- Missing child -> Homozygous parents are phased, no transmission probability is emitted
-- Two individuals missing -> Phase if homozygous, no transmission probability is emitted
-- One parent missing -> Phased / transmission probability emitted
- Mutation prior set as argument
2011-10-24 12:30:04 +02:00
Laurent Francioli 7312e35c71 Now makes use of standard Allele and Genotype classes. This allowed quite some code cleaning. 2011-10-24 10:25:53 +02:00
Laurent Francioli 01b16abc8d Genotype quality calculation modified to handle all genotypes the same way. This is inconsistent with GQ output by the UG but is correct even for cases of poor quality genotypes. 2011-10-24 10:24:41 +02:00
Mark DePristo f6ccac889b Merged bug fix from Stable into Unstable 2011-10-23 16:37:12 -04:00
Mark DePristo 585a45b7a3 Bug fix for ClipReadsWalker when stats output isn't provided
-- See http://getsatisfaction.com/gsa/topics/clipreadswalker?utm_content=topic_link&utm_medium=email&utm_source=reply_notification
2011-10-23 16:36:48 -04:00
Ryan Poplin f5d910b8a5 Haplotype caller now sends genotype likelihoods to the exact model to genotype the events found in the best haplotypes. 2011-10-23 13:29:08 -04:00
Mark DePristo 42bf9adede Initial version of "fast" FragmentPileup code
-- Uses mayOverlapRoutine in ReadUtils
-- Attempts to be smart when doing overlap calculation, to avoid unnecessary allocations
-- PileupElement now comparable (sorts on offset than on start)
-- Caliper microbenchmark to assess performance
2011-10-22 21:36:37 -04:00
Mauricio Carneiro 4913f8a60f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-21 17:45:07 -04:00
Mauricio Carneiro 102dafdcbc Validation of GATKSamRecord in read filters
Moved the validation of the GATKSamRecord to the MalformedReadFilter with the intent to make the read filter the ultimate validation location for sam records. This way we can opt to filter out malformed reads if we know what we are doing or blow up otherwise.
2011-10-21 17:40:43 -04:00
Guillermo del Angel f4b409fa0d CombineVariants bug fix: when merging records with disparate alleles we were leaving AC,AF fields intact. This had as a consequence that we could end up with a record with 3 alt alleles but only 2 values in AC,AF fields. Now, if alleles in combined vc are different from original, and if AC,AF fields can't be recomputed from genotypes, we remove attributes from vc map since they'll be invalid anyway. Integration test md5 changed since there were several badly merged records in result 2011-10-21 14:07:20 -04:00
Mark DePristo b863390cb1 Moving reduced read functionality into GATKSAMRecord
-- More functions take / produce GATKSAMRecords instead of SAMRecord
2011-10-21 13:28:05 -04:00
Mark DePristo 2403e96062 Renamed GATKSamRecord -> GATKSAMRecord for consistency. Better docs. 2011-10-21 09:59:24 -04:00
Mark DePristo 110e13bc1e Merge branch 'master' into SamRecordFactory 2011-10-21 09:43:52 -04:00
Mark DePristo be797a8a1f Recalibrator now uses the much more efficient NGSPlatform in the cycle covariates system 2011-10-21 09:39:21 -04:00
Mark DePristo ed74ebcfa1 GATKSamRecords with efficiency NGSPlatform method 2011-10-21 09:38:41 -04:00
Mark DePristo 94e1898d8f A canonical set of NGS platforms as enums with convenient manipulation methods 2011-10-21 09:37:45 -04:00
Laurent Francioli edea90786a Genotype quality is now recalculated for each of the phased Genotypes. Small problem is that we unnecessarily loose a little precision on the genotypes that do not change after assignment. 2011-10-20 17:04:19 +02:00
Laurent Francioli 1c61a57329 Original rewrite of PhaseByTransmission:
- Adapted to get the trio information from the SampleDB (i.e. from Pedigree file (ped)) => Multiple trios can be passed as argument
- Mendelian violations and trio phasing possibilities are pre-calculated and stored in Maps. => Runtime is ~3x faster
- Genotype combinations possible only given two MVs are now given a squared MV prior (e.g. 0/0+0/0=>1/1 is given 10^-16 prior if the MV prior is 10^-8)
- Corrected bug: In case the best genotype combination is Het/Het/Het, the genotypes are now set appropriately (before original genotypes were left even if they weren't Het/Het/Het)
- Basic reporting added:
-- mvf argument let the user specify a file to report remaining MVs
-- When the walker ends, some basic stats about the genotype reconfiguration and phasing are output

Known problems:
- GQ is not recalculated even if the genotype changes

Possible improvements:
- Phase partially typed trios
- Use standard Allele/Genotype Classes for the storage of the pre-calculated phase
2011-10-20 13:06:44 +02:00
Laurent Francioli ef6a6fdfe4 Added getAsMap -> returns the likelihoods as an EnumMap with Genotypes as keys and likelihoods as values. 2011-10-20 12:49:18 +02:00
Laurent Francioli 76dd816e70 Added getParents() -> returns an arrayList containing the sample's parent(s) if available 2011-10-20 12:47:27 +02:00
Mark DePristo 999a8998ae Constructor for GATKSamRecord with header only, for unit testing 2011-10-19 17:51:48 -04:00
Mark DePristo bba69701b5 Now creates GATKSamRecords now SamRecords 2011-10-19 17:49:17 -04:00
Christopher Hartl cd8a6d62bb You know how the wiki has a big section on commiting local changes to BRANCHES of the repository you clone it from? Yeah. It sucks if you don't do that.
This commit contains:
 - IntronLossGenotyper is brought into its current incarnation
 - A couple of simple new filters (ReadName is super useful for debugging, MateUnmapped is useful for selecting out reads that may have a relevant unaligned mate)
 - RFA now matches my current local repository. It's in flux since I'm transitioning to the new traversal type.
   + the triggering read stash pilot required me to change the scope of some of the variables in the ReadClipping code, private -> protected. Those are all the changes there.
 - MendelianViolation restored to its former glory (and an annotator module that uses the likelihood calculation has been added)
   + use this rather than a hard GQ threshold if you're doing MV analyses.
 - Some miscellaneous QScripts
2011-10-19 17:42:37 -04:00
Mark DePristo 52345f0aec Meaningful documentation string 2011-10-19 15:47:36 -04:00
Mark DePristo 1b38aa1a7e Cleaning up reduced read code accessors 2011-10-19 15:46:44 -04:00
Eric Banks d8d73fe4f2 Treat ./X genotypes as MIXED so that isHet, isHom, etc. still return the expected and correct values. Added docs to these accessors with contracts explicitly mentioned. Fixed case where NPE could be thrown. 2011-10-19 15:11:13 -04:00
Mark DePristo 7928b287fc GATKSamRecord now produced by SAMFileReaders by default
-- Removed all of the unnecessary caching operations in GATKSAMRecord
-- GATKSAMRecord renamed to GATKSamRecord for consistency
2011-10-19 13:15:27 -04:00
Eric Banks 5a6468c11e Allowing ./X genotypes and adding a unit test to ensure that this case is covered from now on (especially given that we may want to revert in the future). Reverting this change is really easy and entails uncommenting a few lines of code. But for now, despite Mark's objections, this case is allowed in the VCF spec and we are wrong not to allow it. 2011-10-19 11:52:05 -04:00
Eric Banks 48c4a8cb33 Make error messages clearer (even I was confused) 2011-10-19 11:49:16 -04:00
Eric Banks 6cadaa84c9 Just use validate() from super class since it does the same thing 2011-10-19 11:48:23 -04:00
Mark DePristo df3e4e1abd First working code to use SamRecordFactory to produce objects of our own design in SAMFileReader 2011-10-19 11:22:35 -04:00
Mauricio Carneiro c27e2fb676 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-18 15:23:05 -04:00
Mark DePristo f77f2eeb7d Fix for new ID structure 2011-10-18 13:04:43 -04:00
Mark DePristo 1a92ee3593 No longer adds a binding of ID -> . when the ID field is dot in the VCF
-- Really we should make ID a primary key in VariantContext.  Putting it into the attributes is just annoying now
2011-10-18 10:57:02 -04:00
Ryan Poplin e45fcb66eb Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-17 15:56:19 -04:00
Ryan Poplin 1e6794c539 fixing typo in VariantsToTable docs 2011-10-17 15:56:02 -04:00
Mark DePristo 0de8550f17 Merged bug fix from Stable into Unstable 2011-10-17 15:29:53 -04:00
Mark DePristo c1329c4dde Fixing a binary to logical or 2011-10-17 15:29:45 -04:00
Mark DePristo 9e4963efc8 Merged bug fix from Stable into Unstable 2011-10-17 15:27:38 -04:00
Mark DePristo ec911ce5bb Even better error messages 2011-10-17 15:27:22 -04:00
Mark DePristo d065bf1715 Merged bug fix from Stable into Unstable 2011-10-17 15:25:47 -04:00
Mark DePristo a7cf9cdc67 Fixing error message typo 2011-10-17 15:25:35 -04:00
Ryan Poplin 589df6b7cf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-17 14:35:14 -04:00
Ryan Poplin 6b02354d84 Adding a new getter in VariantsToTable to extract the indel event length. 2011-10-17 14:34:52 -04:00
Mark DePristo 3550798c4c Merged bug fix from Stable into Unstable 2011-10-17 13:58:56 -04:00
Mark DePristo 4108a294f7 Better error message when a RodBinding file doesn't exist 2011-10-17 13:58:46 -04:00
Mark DePristo cc76826f78 Merged bug fix from Stable into Unstable 2011-10-17 13:38:11 -04:00
Mark DePristo fd4540cd32 Fixed extraordinarily subtle race condition with contracts invariant
-- all of the methods in the class must be synchronized or the internal state can be inconsistent with the contract invariant when entering the class in a non-synchronized method, even when that method doesn't care about the object's internal state
2011-10-17 13:37:55 -04:00
Mark DePristo 5a881360df Merged bug fix from Stable into Unstable 2011-10-13 15:54:43 -04:00
Mark DePristo 7cab6f6bb0 Bug fixes for thread unsafe simple timer and bad Ns treatment in AlignmentUtils
-- SimpleTimer is now threadsafe using synchronized method keywords
-- Bug fix for alignmentToByteArray() where the N case was refPos++ not the now correct refPos += elementLength
2011-10-13 15:53:12 -04:00
Mauricio Carneiro e12ffb6547 Updating docs for GCContentByInterval
This walker does not take any BAMs. It only walks over the reference.
2011-10-13 13:27:00 -04:00
Eric Banks 9aecd50473 Adding ability to exclude annotations from the VA and UG lists. As described in the docs, this argument trumps all others (including -all) so that we can get around the SnpEff issue brought up by Menachem. Added integration test for it. 2011-10-12 15:44:54 -04:00
Mauricio Carneiro e53a952aeb Added ION Torrent support to CountCovariates. 2011-10-12 01:57:02 -04:00
Mauricio Carneiro a2733a451f Added NotCalled feature to GAV
Added "not called" and "no status" to the truth table. Very useful.
2011-10-11 19:31:45 -04:00
David Roazen ae83420637 Merged bug fix from Stable into Unstable 2011-10-11 12:26:08 -04:00
David Roazen 794f275871 SnpEff is now marked as a RodRequiringAnnotation instead of an ExperimentalAnnotation.
Having SnpEff grouped with the Experimental annotations was proving problematic, since it
requires a rod. Placing it in its own group should improve the situation somewhat, making it
easier to request "all annotations except for SnpEff".
2011-10-11 12:08:56 -04:00
David Roazen cfd0ac8410 Merged bug fix from Stable into Unstable
Conflicts:
	public/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java
2011-10-11 12:03:51 -04:00
David Roazen 24b72334b3 UnifiedGenotyper now correctly initializes the VariantAnnotator engine.
This allows the annotation classes to perform any necessary initialization/validation.
For example, it allows the SnpEff annotator to (among other things) validate its rod binding.
This will prevent a NullPointerException when SnpEff annotation is requested but no rod binding
is present.

Added an integration test to cover this case so that it doesn't break again.
2011-10-11 12:02:05 -04:00
Guillermo del Angel 0429b38021 Merged bug fix from Stable into Unstable 2011-10-11 11:19:38 -04:00
Guillermo del Angel 1c485d8b5e Forgot that no matter how trivial a change it's a good idea to compile first 2011-10-11 11:18:41 -04:00
Guillermo del Angel 6418f4d69b Merged bug fix from Stable into Unstable 2011-10-11 11:13:18 -04:00
Guillermo del Angel 1975de1b32 Second try: hide --do_indel_quality in AnalyzeCovariates 2011-10-11 11:11:29 -04:00
Guillermo del Angel 6506ea83e8 Revert "Hide --do_indel_quality argument in AnalyzeCovariates. This shouldn't be documented nor used by external users"... a hidden passenger change made it through.
This reverts commit 70e10ccb1be90dcff8f4485ae6ee036db2d1ac86.
2011-10-11 11:03:12 -04:00
Guillermo del Angel 4c1d8c8d44 Hide --do_indel_quality argument in AnalyzeCovariates. This shouldn't be documented nor used by external users 2011-10-11 11:01:06 -04:00
Eric Banks 77c983c5b5 No one claimed this walker and it doesn't have integration tests or GATKdocs so it doesn't belong in public. 2011-10-10 15:17:54 -04:00
Mark DePristo fb72bcf732 DiffObjects no longer prints out the file name in the status so MD5 are stable 2011-10-10 15:10:57 -04:00
Mark DePristo 46e7370128 this.allele, getAlleles(), and getAltAlleles() now return List not set
-- Changes associated code throughout the codebase
-- Updated necessary (but minimal) UnitTests to reflect new behavior
-- Much better makealleles() function in VC.java that enforces a lot of key constraints in VC
2011-10-09 11:45:55 -07:00
Mark DePristo c67f6c076b simpleMerge now preserves allele order
-- UnitTests for dangerous PL merging cases in the multi-allelic case.  The new behavior is correct
2011-10-08 17:39:53 -07:00
Mark DePristo ec14a4a606 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-07 08:38:50 -07:00
Eric Banks ca9cd9b688 Minor fix for merging intervals which hadn't been necessary when only merging from the left to right. Added integration tests to cover the parallelization of RTC. 2011-10-06 22:38:44 -04:00
Mark DePristo c7864c7256 Filter application order is now deterministic, in the order defined by the walker
-- For no apparent reason we were using a HashSet to store the ReadFilters, so the order of operations was really arbitrarily applied.  The order now is

(1) the order of the walker intrinsic filters
(2) read group black list (if provided)
(3) command line filters (if provided)
2011-10-06 18:51:40 -07:00
Mark DePristo 0b88af4af9 Counts of records failing filters are displayed sorted
-- Stops random ordering of the output, as the counts are returned sorted by string name of the class
-- Deleted now unused sh*tty assessors in Utils
2011-10-06 18:42:26 -07:00
Mark DePristo d1e70d6ec2 Removed Nx counting of reads in metrics with -nt > 1 2011-10-06 18:29:26 -07:00
Eric Banks c61804a450 Rename the long version of the argument name to more accurately reflect its purpose. 2011-10-06 16:14:04 -04:00
Eric Banks 61a3dfae24 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 15:58:04 -04:00
Eric Banks 6eb87bf58a RTC now caches all intervals as GenomeLocs (which is expected to take < 1Gb whole genome based on back of the envelope calculations with Matt) so that 1) we don't have to worry about emitting outside of the leaves in the hierarchical reductions and 2) we can emit the intervals in sorted order which is a big performance plus for the realigner. Integration tests change only because intervals whose start=stop are now printed as chr:start instead of chr:start-stop. 2011-10-06 15:57:49 -04:00
Eric Banks 1b0735f0a3 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 13:41:45 -04:00
Eric Banks c4dfc1fb8b Temporary commit of parallelization support for RealignerTargetCreator. Tim begged us for this and I got assurances from Khalid/Matt that this would also be extremely helpful for the whole genome calling pipeline, so I spent a while working on this. Needs to be fixed up though because apparently only the leaves in the hierarchical reduce get their output aggregated. Worked out a better solution with Matt. 2011-10-06 13:41:36 -04:00
Mark DePristo 73f9d1f217 GATK read group requirement iron hand
-- The GATK will now throw a user exception if it opens a SAM/BAM file that doesn't have at least one RG defined
-- LIBS again throws an error if the complete list of samples isn't provided
-- Updating ExmpleCountLociPipeline test to use the well-formated versions of the exampleBAM and exampleFASTA files in testdata, instead of the old broken ones in validation_data.
-- Convenience constructors for UserExceptions.MalformedBAM
2011-10-06 08:40:35 -07:00
Mark DePristo 23845ac798 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 08:17:08 -07:00
Mark DePristo daa5999489 Fixed typo in argument description 2011-10-06 08:16:25 -07:00
Guillermo del Angel 8a474e38ff Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 10:08:39 -04:00
Guillermo del Angel 93f7e632bd Minor fix/enhancement for VariantEval: if a vcf has symbolic alleles, program would crash ungracefully - now we'll just skip record without processing. This is a big issue since we can't process 1000G integration files with code as is. 2011-10-06 10:07:46 -04:00
Mark DePristo 190be4d0d1 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-05 21:27:11 -07:00
Mark DePristo 8e6845806a Allowing empty samples list in LIBS
-- Right now we cannot process BAM files without read groups because we enforce the samples list to not be empty when there's a SAM record.  Now if there are reads and there are no samples we add the "null" sample so that LIBS walks the reads properly
2011-10-05 21:26:21 -07:00
Matt Hanna 180c8f286f Merged bug fix from Stable into Unstable 2011-10-05 20:37:43 -04:00
Matt Hanna 55b9f06527 Ensure that IndelRealigner n-way out option supports MD5 generation. 2011-10-05 20:36:28 -04:00
Mark DePristo be2d29ce69 Final PED documentation 2011-10-05 15:17:41 -07:00
Mark DePristo 3226d5dc0d Merge branch 'master' into ped 2011-10-05 15:03:09 -07:00
Mark DePristo 6a573437af Details documentation arguments for -ped 2011-10-05 15:00:58 -07:00
Mark DePristo e7c80f7c45 Renaming quantitative trait to OtherPhenotype which is now a String not a double
-- we can now use PED file to represent population data or other arbitrary phenotype data, not just doubles
2011-10-05 12:26:33 -07:00
Mark DePristo 51ecc20867 getFamily() and associated methods implemented and tested
-- Sample no longer serializable
-- Sample now implements Comparable
2011-10-05 09:55:05 -07:00
Mark DePristo a45d985818 TODO method stubs 2011-10-04 15:54:09 -07:00
Mark DePristo fee89e47ff Only throws an error when there are no samples but there are reads
-- Handles the case when you are running a ROD traversal and yet the LIBS is still used to return null everywhere.
2011-10-04 06:50:54 -07:00
Mark DePristo f552aede42 Only provide the sample names in the BAM file for efficiency 2011-10-04 06:50:12 -07:00
Mark DePristo a27641e1fc Cleaned up imports 2011-10-04 06:28:36 -07:00
Mark DePristo b20689ff55 No longer supports extraProperties
-- the underlying data structure is still present, but until I decide what to do for the extensible system I've completely disabled the subsystem
-- Added code to merge Samples, so that a mostly full record can be merged with a consistent empty record.  If the two records are inconsistent, an error is thrown
-- addSample() in Sample.class now invokes mergeSample() when appropriate
-- Validation types are now only STRICT or SILENT
-- Validation code implemented in SampleDBBuilder
-- Extensive unit tests for SampleDBBuilder
2011-10-03 19:20:33 -07:00
Mauricio Carneiro 3837aa45b4 Fixing conflicts
Conflicts:
	public/java/test/org/broadinstitute/sting/utils/clipreads/ReadClipperUnitTest.java
2011-10-03 19:07:59 -07:00
Mark DePristo 2e3dc52088 Minor function renaming 2011-10-03 14:41:13 -07:00
Mark DePristo dd71884b0c On path to SampleDB engine integration
-- PedReader tag parser
-- Separation of SampleDBBuilder from SampleDB (now immutable)
-- Removed old sample engine arguments
2011-10-03 12:08:07 -07:00
Eric Banks c3eff7451a Found a small inefficiency while profiling: we were still using String.split instead of ParsingUtils.split to break up array values in the INFO field. There was a noticeable (albeit not big) difference in the change when reading sites only files. 2011-10-03 14:20:39 -04:00
Mark DePristo 8ee0f91904 Remove residual processing tracker arguments 2011-10-03 09:50:01 -07:00
Mark DePristo 89ac50e86e SampleDataSource -> SampleDB 2011-10-03 09:33:30 -07:00
Mark DePristo 93fba06cb5 Support for whitespace only lines 2011-10-03 09:30:10 -07:00
Mark DePristo 0604ce55d1 PedReader support for ; separated lines, not only newline 2011-10-03 09:19:58 -07:00
Mark DePristo 52f670c8b8 100% version of PedReader
-- Passes all unit tests
-- Added unit tests for missing fields
2011-10-03 06:12:58 -07:00
Mark DePristo dd75ad9f49 95% PedReader
-- Passes significiant unit tests
-- Implicit sample creation for mom / dad when you create single samples
-- Continuing cleanup of Sample and SampleDataSource
2011-09-30 18:03:34 -04:00
Andrey Sivachenko c7898a9be7 inconsequential change in string constants printed into the vcf which noone uses anyway... 2011-09-30 16:40:21 -04:00
Mark DePristo 010899f886 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-30 15:51:09 -04:00
Mark DePristo 84160bd83f Reorganization of Sample
-- Moved Gender and Afflication to separate public enums
-- PedReader 90% implemented
-- Improve interface cleanup to XReadLines and UserException
2011-09-30 15:50:54 -04:00
Mauricio Carneiro 05fba6f23a Clipping ends inside deletion and before insertion
fixed.
2011-09-30 15:44:43 -04:00
Mark DePristo c1cf6bc45a PEDReader should be in samples 2011-09-30 14:22:19 -04:00
Mark DePristo 56f10b40a8 Fixing test bugs for WindowMaker that required empty sample list 2011-09-30 14:18:27 -04:00
Ryan Poplin af6c053435 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-30 13:33:31 -04:00
Mark DePristo 810e8ad011 Removed getXByReaders() function from the engine
-- These could be simplied in their downstream uses
-- Or they could be replaced with a generic getSAMFileHeaders() function and then apply the getSamples(header) as desired downstream
2011-09-30 10:43:51 -04:00
Mark DePristo 178ba24c27 Move getSamplesForSamFile to SampleUtils
-- A nearly identical piece of code already lived in SampleUtils.  Now there are two functions, one taking a regular header and another grabbing the merged header from the GATK engine itself.  Much cleaner
2011-09-30 10:28:18 -04:00
Mark DePristo 30d23942b1 Renamed ReadBackedPileup getXSampleName() functions to getXSample
-- now that we don't have Sample objects floating around we don't have to have all of the Name extensions on our functions
2011-09-30 10:02:57 -04:00
Mark DePristo 3289a325fc Removed final use of Sample in RBP 2011-09-30 09:57:39 -04:00
Mark DePristo a69a4dda2f SamplesDB no longer has null sample
-- Updated getSamples().size() == 2 test in CallableLociWalker that really ensured there was one sample in the system
2011-09-30 09:56:23 -04:00
Mark DePristo e055a78f6e LIBS now requires at least one sample be present
-- UnitTest provides a "null" sample for matching the reads without read groups
2011-09-30 09:49:35 -04:00
Mark DePristo 9860a2c989 Merge branch 'master' into ped 2011-09-30 09:28:18 -04:00
Mark DePristo d901fed617 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-30 08:41:44 -04:00
Mauricio Carneiro cabacf028d Intermediate commit to fix interval skipping
may need additional testing.
2011-09-29 18:45:12 -04:00
Mark DePristo 1765fbeb6b Merge branch 'master' into ped 2011-09-29 17:18:51 -04:00
Mark DePristo 98ecaf8aa0 Support for ReducedReads with reduced counts and average quals
-- ReadUtils and UnitTest updated to support new byte[] style
-- Removed unnecessary read transformer in PairHMM
2011-09-29 17:18:39 -04:00
Mauricio Carneiro 9508220157 fixed hard clipping both ends inside deletion
If both ends of the interval falls within a deletion in the read then hardClipBothEnds would cut the right tail first including the entire deletion, then fail to cut the left tail because there would not be any bases there anymore. Fixed.
2011-09-29 15:36:49 -04:00
Mark DePristo 625ffb6a07 LocusIteratorByState and ReadBackedPileups no long use Sample 2011-09-29 14:52:11 -04:00
Mark DePristo b3a2371925 Merge branch 'master' into ped 2011-09-29 14:32:17 -04:00
Mark DePristo 68761a6e28 Removed sample from header 2011-09-29 14:13:05 -04:00
Mauricio Carneiro a5e75cd14c Outputting both consensus base qualities and counts
The base qualities of a consensus reads are now the average quality of the bases forming the consensus base (most common base) and the consensus quality tag now carry an array with the counts of each base in the consensus. This should increase file size but improve calling sensitivity/specificity.
2011-09-29 12:54:41 -04:00
Mark DePristo 505416b6c0 Merge branch 'master' into ped 2011-09-29 12:22:39 -04:00
Mark DePristo 9536845e35 Cleaning up unused code in MV 2011-09-29 12:20:07 -04:00
Mark DePristo 5043d76c3d Removing more bad uses of SampleDataSource creation 2011-09-29 12:16:34 -04:00
Mark DePristo 5c9227cf5e Further cleanup of Sample database
-- Removing more and more unnecessary code
-- Partial removal of type safe Sample usage.  On the road to SampleDB only
2011-09-29 11:50:05 -04:00
Mark DePristo 2a0cd556d3 Further cleanup of Sample
-- Cleaned up interface functions in GAE
-- Added Walker.getSampleDB() function which is an easier option for tools to get the samples db
2011-09-29 10:34:51 -04:00
Mark DePristo e76f381628 Moved sample package from DataSources to gatk, and renamed it samples
-- All associated changes to the codebase are just header updates
2011-09-29 09:57:15 -04:00
Mark DePristo e197dcd1f3 Pre-cleanup commit of Sample and SampleDataSource
-- SampleDataSource has all reader functionality disabled
2011-09-29 09:44:18 -04:00
Mark DePristo 4d31673cc5 No longer supporting YAML file allows us to delete 75% of the sample's codebase 2011-09-29 09:43:31 -04:00
Ryan Poplin e366ee18bc Adding ability to read in and make use of kmer quality tables during HMM evaluation 2011-09-29 07:46:19 -04:00
Mauricio Carneiro fc86cd6fd8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/carneiro/gatk/RR into rr 2011-09-29 00:12:15 -04:00
Roger Zurawicki 4fd5630f6a Added ReadClipper Unit Test
* Includes tests that include HardClip to Read and Reference Coords.
* Changed ReadUtils.HardClipByReferenceCoordinates from private to protected to allow for testing
2011-09-28 23:13:50 -04:00
Matt Hanna 9272ed03b5 Merged bug fix from Stable into Unstable 2011-09-28 21:26:43 -04:00
Matt Hanna 0acaf2df65 Fix an embarrassing issue where a specific configuration of minimal coverage
over small intervals could cause reads to be dropped from the pileup.  Nothing
to see here...
2011-09-28 21:23:01 -04:00
Guillermo del Angel c8d3a720f9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-28 18:17:34 -04:00
Guillermo del Angel 7e3cb45093 Further performance optim in banded hmm, about 60% speed improvement over current implementation now 2011-09-28 16:27:28 -04:00
Ryan Poplin 1b1ca80df2 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-28 16:17:39 -04:00
Ryan Poplin 3b73dc89fe Making several esoteric arguments in the BQSR @Hidden. Adding basic support for Complete Genomics machine cycle. 2011-09-28 16:17:31 -04:00
Mauricio Carneiro ff2f4df043 Fixed hardclipping inside indel (right tail)
when hard clipping the right tail of a read falls inside a deletion, clipping should fall back to the last base before the deletion to follow the ReadClipper's contract.
2011-09-28 16:07:34 -04:00
Mauricio Carneiro 3c7b7f74ef Optimized interval iteration
Using a TreedSet to manipulate getToolkit.getIntervals() and being smart about which intervals to test makes interval clipping O(1) instead of O(n).
2011-09-28 16:07:34 -04:00
Mauricio Carneiro 5c9b659c02 clipping both ends of the reads was modifying the original read
This goes against the ReadClipper contract, and was affecting the second part of the read that spans over multiple intervals. Fixed.
2011-09-28 16:07:34 -04:00
Guillermo del Angel fe23e4d10c Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-28 15:53:11 -04:00
Guillermo del Angel e2b9030e93 First mostly fully functional implementation of banded pair HMM likelihood computation for indel caller. More experimentation to follow but it right now works in small data sets and at least it doesn't break existing things. Disabled by default at this point 2011-09-28 15:51:48 -04:00
Eric Banks 1b45f21774 Removing this command-line tool. Purposely not doing this in stable so that users who may still use it have time to find other options. But the docs are no longer on the wiki. 2011-09-28 13:18:32 -04:00
Eric Banks 1f0e354fae Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-09-28 13:13:21 -04:00
Eric Banks bb619a9a3c Fixing docs 2011-09-28 13:13:03 -04:00
Mark DePristo 5812004e06 Merge branch 'stable' 2011-09-28 11:36:40 -04:00
Mark DePristo a5006831d7 Shows "" not empty space when default string value is "" 2011-09-28 11:35:52 -04:00
Mark DePristo 1e32281a15 Fix to not show -null when missing short name argument 2011-09-28 11:31:20 -04:00
Mauricio Carneiro 89544c209c Fixing contracts
changed return type to Pair, changing contracts accordingly.
2011-09-28 11:19:17 -04:00
Eric Banks eacbee3fe5 Merged bug fix from Stable into Unstable 2011-09-27 20:35:18 -04:00
Eric Banks 43b0c98298 Fix docs 2011-09-27 20:34:46 -04:00
Eric Banks 232a6df11c Add longhand form to the error message. 2011-09-27 20:29:31 -04:00
Eric Banks 1d6fcb6eb1 Revert "Add longhand form to the error message to prevent users from posting borderline dumb posts to GS."
This reverts commit 75b2600527cfce05ae683cb394290ff2a80e8552.
2011-09-27 20:27:00 -04:00
Eric Banks 269b9826b6 Add longhand form to the error message to prevent users from posting borderline dumb posts to GS. 2011-09-27 20:26:36 -04:00
Mauricio Carneiro 3b6e43b7c4 Use reads that span multiple intervals
* RR will now compress reads that span across multiple intervals correctly and output them in the correct order.
* Fixed bug in getReadCoordinateForReferenceCoordinate where if the requested reference coordinate fell inside a deletion in the read the read would be clipped up to one element past the deletion.
2011-09-27 18:39:06 -04:00
Khalid Shakir 84bd355690 Merged bug fix from Stable into Unstable 2011-09-27 14:34:39 -04:00
Khalid Shakir b090751f62 Fixed Ant / PluginManager issue where reflections was picking up all class files under current working directory due to "." in jar manifest classpaths.
Updates to HybridSelectionPipeline:
- Added annotations back via snpEff
- Minor updates to VQSR paths and lowered memory
2011-09-27 14:33:57 -04:00
Eric Banks 26e71f6688 The Omni files have multiple records (with the same ALT) at a particular location, with one PASSing and the other(s) filtered. Chris, this is why using this file as both eval and comp leads to ref/no-call cells in the GenotypeConcordance table. However, this led to non-determinism in VE because the VCs were placed in a HashSet; we use a LinkedHashMap instead to bring back determinism. 2011-09-27 11:03:17 -04:00
Guillermo del Angel ceffefa6a6 Intermediate version with banded pair HMM 2011-09-27 10:18:58 -04:00
Mark DePristo e99ff3caae Removed lots of old, and not to be used, HMM options
-- resulted in massive code cleanup
-- GdA will integrate his new banded algorithm here
-- Removed: DO_CONTEXT_DEPENDENT_PENALTIES, GET_GAP_PENALTIES_FROM_DATA, INDEL_RECAL_FILE, dovit, GSA_PRODUCTION_ONLY
2011-09-27 10:08:40 -04:00