Commit Graph

9593 Commits (c8ed0bfc4cf49c5edd27ef2e0f9e92593cfa375c)

Author SHA1 Message Date
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo 6301572009 GenotypeLikelihood PLs are capped at Short.MAX_INT now
-- UserExceptions in BCF2 now where appropriate
-- Asserts for code safety
-- Public -> protected encode(Object v) method is for testing only
2012-05-24 10:57:06 -04:00
Mark DePristo d52bc31a47 Bugfix for doNotWriteGenotypes mode
-- Was outputing GT ./. in sites only mode.  Fixed
2012-05-24 10:57:05 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00
Mark DePristo b5bce8d3f9 AD should be UNBOUNDED, actually
-- Pass in # alt alleles as appropriate for getCount in VCF header line
2012-05-24 10:57:05 -04:00
Mark DePristo ac7460ef8c New complex VCF files for testing 2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo f0b081a85f Update VCF.jar loading test
-- to reflect new path to VCFWriter
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo d13cda6b6f Update encoding / decoding of genotypes to final spec version 2012-05-24 10:56:56 -04:00
Mark DePristo 0921c3096c -- Disable genotype filtering since it doesn't work
-- Update code to name new, more general decodeSingleValue
-- Update MISSING_VALUE constants to be 0xFFFFFF80 vs. 0x00000080 as these are equivalent for a byte and handle the two complement cast from byte to int
-- Fix decoding of byte and short values which were screwing up missing values
-- Code cleanup in decoder
-- Generalize bestIntegerType function
-- Handle the encoding of boolean FLAG fields
-- Test the encoding of vectors of values
2012-05-24 10:56:56 -04:00
Mark DePristo c0c4599fe1 Cleanup naming of encoder functions for clarity 2012-05-24 10:56:55 -04:00
Mark DePristo 1d39a9227b Low-level encoder / decoder unit tests and code cleanup
-- BCF2 encoder and BCF2 decoder are now fully tested, and are working correctly.
-- Code cleanup and reorganization to fix bugs encountered during testing.
2012-05-24 10:56:55 -04:00
Mark DePristo 443c83d4a7 Removed old STRING_BY_REF types
-- No more String by ref.  Everything is encoded as base datatypes, and codec looks up ints as dictionary strings as it likes
2012-05-24 10:56:55 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81ab0dd051 Implement separate data blocks for sites and genotypes data 2012-05-24 10:56:54 -04:00
Mark DePristo fd988274c1 Separate the BCF2 codec from the BCF2 decoder
-- Decoder is a low-level reader of underlying data from a BCF2 encoded stream
-- Codec uses the decoder to build a VC from the stream
-- Separation key for upcoming UnitTest framework that will ensure correctness of low-level decoder / encoder before optimization of the encoder / decoder starts
2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 931b575748 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Support for string dictionary at standard positions in the VCF
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 7541b523e1 BCF improvements
-- now reads / writes untyped chrom offset, start, refLength, and qual at start of records
-- canDecode() works for BCF2
2012-05-24 10:56:51 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Joel Thibault 085588cb04 Not Nexus. Need new name. Navel? 2012-05-24 10:11:58 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Guillermo del Angel 7fe07a4ae6 Bug fix: prevent index out of bounds error if reference sample in pool caller has a call present at a site but genotype is a no-call allele 2012-05-22 21:06:53 -04:00
Joel Thibault dad75babf1 Increase Queue memory limits to 16 GB 2012-05-22 10:50:47 -04:00
Joel Thibault af3d73b884 Re-enable partitioning for Mongo reads (but not writes) 2012-05-22 10:50:47 -04:00
Ryan Poplin 692addb498 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-22 10:25:03 -04:00
Ryan Poplin c3fb321014 Minor updates to pacbio data processing script to make it work with the latest bwa version/settings. 2012-05-22 10:24:45 -04:00
Christopher Hartl d366cce714 Initial commit of a burden testing framework. Currently tests against only one phenotype and only one weighting function, but computes robust weighted dosages and calls into an R script that calculates both a direct glm LRT and an asymptotic normal p-values. Weights currently read in from external file (beta-values). Future work is to let these be calculated on the fly from e.g. annotation, potential impact, conservation, etc, and enable multiple weighting schemes tested jointly for association against multiple phenotypes. 2012-05-21 16:56:32 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00
Ryan Poplin 04000d920c Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads. 2012-05-21 16:46:59 -04:00
Khalid Shakir 94cd4e6a7d Updated WGP min confidence from 4 to 10 based on recommendations from depristo and ebanks. 2012-05-21 16:41:45 -04:00
Eric Banks 666862af19 Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs 2012-05-21 16:03:29 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 7f5ec17d22 Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table. 2012-05-21 14:16:13 -04:00