Commit Graph

9737 Commits (7fbca7013e7b9da220396d9bcaec209efdf67cb7)

Author SHA1 Message Date
Mark DePristo ac7460ef8c New complex VCF files for testing 2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo f0b081a85f Update VCF.jar loading test
-- to reflect new path to VCFWriter
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo d13cda6b6f Update encoding / decoding of genotypes to final spec version 2012-05-24 10:56:56 -04:00
Mark DePristo 0921c3096c -- Disable genotype filtering since it doesn't work
-- Update code to name new, more general decodeSingleValue
-- Update MISSING_VALUE constants to be 0xFFFFFF80 vs. 0x00000080 as these are equivalent for a byte and handle the two complement cast from byte to int
-- Fix decoding of byte and short values which were screwing up missing values
-- Code cleanup in decoder
-- Generalize bestIntegerType function
-- Handle the encoding of boolean FLAG fields
-- Test the encoding of vectors of values
2012-05-24 10:56:56 -04:00
Mark DePristo c0c4599fe1 Cleanup naming of encoder functions for clarity 2012-05-24 10:56:55 -04:00
Mark DePristo 1d39a9227b Low-level encoder / decoder unit tests and code cleanup
-- BCF2 encoder and BCF2 decoder are now fully tested, and are working correctly.
-- Code cleanup and reorganization to fix bugs encountered during testing.
2012-05-24 10:56:55 -04:00
Mark DePristo 443c83d4a7 Removed old STRING_BY_REF types
-- No more String by ref.  Everything is encoded as base datatypes, and codec looks up ints as dictionary strings as it likes
2012-05-24 10:56:55 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81ab0dd051 Implement separate data blocks for sites and genotypes data 2012-05-24 10:56:54 -04:00
Mark DePristo fd988274c1 Separate the BCF2 codec from the BCF2 decoder
-- Decoder is a low-level reader of underlying data from a BCF2 encoded stream
-- Codec uses the decoder to build a VC from the stream
-- Separation key for upcoming UnitTest framework that will ensure correctness of low-level decoder / encoder before optimization of the encoder / decoder starts
2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 931b575748 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Support for string dictionary at standard positions in the VCF
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 7541b523e1 BCF improvements
-- now reads / writes untyped chrom offset, start, refLength, and qual at start of records
-- canDecode() works for BCF2
2012-05-24 10:56:51 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Joel Thibault 085588cb04 Not Nexus. Need new name. Navel? 2012-05-24 10:11:58 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Guillermo del Angel 7fe07a4ae6 Bug fix: prevent index out of bounds error if reference sample in pool caller has a call present at a site but genotype is a no-call allele 2012-05-22 21:06:53 -04:00
Joel Thibault dad75babf1 Increase Queue memory limits to 16 GB 2012-05-22 10:50:47 -04:00
Joel Thibault af3d73b884 Re-enable partitioning for Mongo reads (but not writes) 2012-05-22 10:50:47 -04:00
Ryan Poplin 692addb498 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-22 10:25:03 -04:00
Ryan Poplin c3fb321014 Minor updates to pacbio data processing script to make it work with the latest bwa version/settings. 2012-05-22 10:24:45 -04:00
Christopher Hartl d366cce714 Initial commit of a burden testing framework. Currently tests against only one phenotype and only one weighting function, but computes robust weighted dosages and calls into an R script that calculates both a direct glm LRT and an asymptotic normal p-values. Weights currently read in from external file (beta-values). Future work is to let these be calculated on the fly from e.g. annotation, potential impact, conservation, etc, and enable multiple weighting schemes tested jointly for association against multiple phenotypes. 2012-05-21 16:56:32 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00
Ryan Poplin 04000d920c Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads. 2012-05-21 16:46:59 -04:00
Khalid Shakir 94cd4e6a7d Updated WGP min confidence from 4 to 10 based on recommendations from depristo and ebanks. 2012-05-21 16:41:45 -04:00
Eric Banks 666862af19 Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs 2012-05-21 16:03:29 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 7f5ec17d22 Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table. 2012-05-21 14:16:13 -04:00
Joel Thibault 27c46b8071 Better matching and searching between sites and samples 2012-05-21 09:50:49 -04:00
Joel Thibault 8fb6fc9ff9 Contigs as blocks are too large for MongoDB documents 2012-05-21 09:50:49 -04:00
Eric Banks c1c70f3b41 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 09:39:08 -04:00
Eric Banks 92d8aa3d4c Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels 2012-05-21 09:38:52 -04:00
Guillermo del Angel 5cc9a12fbb Fixed definition in VCF header for pool caller genotype parameters MLAC and MLAF 2012-05-19 14:53:37 -04:00
Eric Banks 3af3834d50 Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian):
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE.  Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
2012-05-18 11:55:41 -04:00