Commit Graph

9580 Commits (afd2f1a3f95b7174be5686a3ed663669858d89ee)

Author SHA1 Message Date
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo f0b081a85f Update VCF.jar loading test
-- to reflect new path to VCFWriter
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo d13cda6b6f Update encoding / decoding of genotypes to final spec version 2012-05-24 10:56:56 -04:00
Mark DePristo 0921c3096c -- Disable genotype filtering since it doesn't work
-- Update code to name new, more general decodeSingleValue
-- Update MISSING_VALUE constants to be 0xFFFFFF80 vs. 0x00000080 as these are equivalent for a byte and handle the two complement cast from byte to int
-- Fix decoding of byte and short values which were screwing up missing values
-- Code cleanup in decoder
-- Generalize bestIntegerType function
-- Handle the encoding of boolean FLAG fields
-- Test the encoding of vectors of values
2012-05-24 10:56:56 -04:00
Mark DePristo c0c4599fe1 Cleanup naming of encoder functions for clarity 2012-05-24 10:56:55 -04:00
Mark DePristo 1d39a9227b Low-level encoder / decoder unit tests and code cleanup
-- BCF2 encoder and BCF2 decoder are now fully tested, and are working correctly.
-- Code cleanup and reorganization to fix bugs encountered during testing.
2012-05-24 10:56:55 -04:00
Mark DePristo 443c83d4a7 Removed old STRING_BY_REF types
-- No more String by ref.  Everything is encoded as base datatypes, and codec looks up ints as dictionary strings as it likes
2012-05-24 10:56:55 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81ab0dd051 Implement separate data blocks for sites and genotypes data 2012-05-24 10:56:54 -04:00
Mark DePristo fd988274c1 Separate the BCF2 codec from the BCF2 decoder
-- Decoder is a low-level reader of underlying data from a BCF2 encoded stream
-- Codec uses the decoder to build a VC from the stream
-- Separation key for upcoming UnitTest framework that will ensure correctness of low-level decoder / encoder before optimization of the encoder / decoder starts
2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 931b575748 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Support for string dictionary at standard positions in the VCF
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 7541b523e1 BCF improvements
-- now reads / writes untyped chrom offset, start, refLength, and qual at start of records
-- canDecode() works for BCF2
2012-05-24 10:56:51 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Joel Thibault 085588cb04 Not Nexus. Need new name. Navel? 2012-05-24 10:11:58 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Guillermo del Angel 7fe07a4ae6 Bug fix: prevent index out of bounds error if reference sample in pool caller has a call present at a site but genotype is a no-call allele 2012-05-22 21:06:53 -04:00
Joel Thibault dad75babf1 Increase Queue memory limits to 16 GB 2012-05-22 10:50:47 -04:00
Joel Thibault af3d73b884 Re-enable partitioning for Mongo reads (but not writes) 2012-05-22 10:50:47 -04:00
Ryan Poplin 692addb498 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-22 10:25:03 -04:00
Ryan Poplin c3fb321014 Minor updates to pacbio data processing script to make it work with the latest bwa version/settings. 2012-05-22 10:24:45 -04:00
Christopher Hartl d366cce714 Initial commit of a burden testing framework. Currently tests against only one phenotype and only one weighting function, but computes robust weighted dosages and calls into an R script that calculates both a direct glm LRT and an asymptotic normal p-values. Weights currently read in from external file (beta-values). Future work is to let these be calculated on the fly from e.g. annotation, potential impact, conservation, etc, and enable multiple weighting schemes tested jointly for association against multiple phenotypes. 2012-05-21 16:56:32 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00
Ryan Poplin 04000d920c Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads. 2012-05-21 16:46:59 -04:00
Khalid Shakir 94cd4e6a7d Updated WGP min confidence from 4 to 10 based on recommendations from depristo and ebanks. 2012-05-21 16:41:45 -04:00
Eric Banks 666862af19 Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs 2012-05-21 16:03:29 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 7f5ec17d22 Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table. 2012-05-21 14:16:13 -04:00
Joel Thibault 27c46b8071 Better matching and searching between sites and samples 2012-05-21 09:50:49 -04:00
Joel Thibault 8fb6fc9ff9 Contigs as blocks are too large for MongoDB documents 2012-05-21 09:50:49 -04:00
Eric Banks c1c70f3b41 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 09:39:08 -04:00
Eric Banks 92d8aa3d4c Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels 2012-05-21 09:38:52 -04:00
Guillermo del Angel 5cc9a12fbb Fixed definition in VCF header for pool caller genotype parameters MLAC and MLAF 2012-05-19 14:53:37 -04:00
Eric Banks 3af3834d50 Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian):
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE.  Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
2012-05-18 11:55:41 -04:00
Eric Banks 26968ae8eb Forgot that the VCFStreamingOntegrationTest uses VE 2012-05-18 02:51:53 -04:00
Eric Banks 52c206d5db Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports. 2012-05-18 02:32:20 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Guillermo del Angel 5189b06468 New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality. 2012-05-17 15:28:19 -04:00
David Roazen 9c6bccfd8b build system overhaul
* Added support for a protected directory whose contents are only made public in binary form

* Simplified and reorganized build.xml to improve readability and maintainability

* build.xml now autodetects most build properties:
    -Includes private/protected if they exist
    -No more STING_BUILD_TYPE or specialized targets for public-only, etc.

* Build targets have changed! There are now two main build options:

"ant"       build everything (GATK and Queue)
"ant gatk"  build just the GATK

It was too hard to build everything before -- now it is the default.

* To run tests with debugging, use -Dtest.debug=true -Dtest.debug.port=XXXX on the command line.
  Much better than the old comment/uncomment method!
2012-05-17 15:16:29 -04:00
Eric Banks 0f7c917e7a Better error checking and messages for bad alleles 2012-05-17 13:36:42 -04:00