Commit Graph

48 Commits (ee7d85c6e608fc58ede79f08f7dbc8de9ebd3019)

Author SHA1 Message Date
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Eric Banks 525cf331f4 Don't catch a User Error and re-throw as a Reviewed Exception. That makes Eric unhappy. 2012-11-13 11:52:47 -05:00
Khalid Shakir 97dc3664c9 Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621. 2012-10-22 12:05:32 -04:00
Mauricio Carneiro 3504f71b6b Fixing a null pointer exception bug for DEV-10 2012-10-18 13:58:38 -04:00
kshakir 213cc00abe Refactored argument matching to support other plugins in addition to file lists.
Added plugin support for sending Queue status messages.
Argument parsing can store subclasses of java.io.File, for example RemoteFile.
2012-10-15 15:10:45 -04:00
Mauricio Carneiro 80d92e0c63 Allowing the GATK to have non-required outputs
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).

This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.

[#DEV-10 transition:31 resolution:1]
2012-10-15 13:49:08 -04:00
Eric Banks 85525d9e6e Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit). 2012-10-12 09:19:50 -04:00
Mark DePristo 9fd30d6f1c When writing the initial commit for nt + nct I realized this class was really just a ThreadGroupOutputTracker
-- The code is cleaner and the logical more obvious now.
2012-09-24 14:15:36 -04:00
Mark DePristo a6b3497eac Fixes GSA-515 Nanoscheduler GSA-577 -nt and -nct together appear to not close resources properly
-- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker.
-- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map.  Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead)
-- Removed unnecessary debug statement in GenomeLocParser
-- nt and nct officially work together now
2012-09-24 14:15:35 -04:00
Mark DePristo b5fa848255 Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output
-- See https://jira.broadinstitute.org/browse/GSA-573
-- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread.
-- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.
2012-09-20 18:45:16 -04:00
Eric Banks 4bb7a99f08 Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it. 2012-09-12 11:51:44 -04:00
Mark DePristo 9bf1d138d9 New GATK argument interface for data and cpu threads
-- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler
-- Old -nt means dataThreads
-- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system
-- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS
-- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value
-- Removed the io,cpu tags for nt.  Stupid system if you ask me.  Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now
2012-09-05 15:45:24 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Mark DePristo 52bfe8db8a Make sure the storage writer is closed before running mergeInfo in multi-threaded output management
-- It's not clear this is cause of GSA-484 but it will help confirm that it's not the cause
2012-08-16 13:03:13 -04:00
Mark DePristo 7a247df922 Added -bcf argument to VCFWriter output to force BCF regardless of file extension
-- Now possible to do -o /dev/stdout -bcf -l DEBUG > tmp.bcf and create a valid BCF2 file
-- Cleanup code to make sure extensions easier by moving to a setX model in VariantContextWriterStub
2012-08-16 13:03:13 -04:00
Mark DePristo 9dc694b2e9 Meaningful error message and keeping tmp file when mergeInfo fails
-- BCF2 is failing for some reason when merging tmp. files with parallel combine variants.  ThreadLocalOutputTracker no longer sets deleteOnExit on the tmp file, as this prevents debugging.  And it's unnecessary because each mergeInto was deleting files as appropriate
-- MergeInfo in VariantContextWriterStorage only deletes the intermediate output if an error occurs
2012-08-16 13:03:13 -04:00
Mark DePristo bd7ed0d028 Enable efficient parallel output of BCF2
-- Previous IO stub was hardcoded to write VCF.  So when you ran -nt 2 -o my.bcf you actually created intermediate VCF files that were then encoded single threaded as BCF.  Now we emit natively per thread BCF, and use the fast mergeInfo code to read BCF -> write BCF.  Upcoming optimizations to avoid decoding genotype data unnecessarily will enable us to really quickly process BCF2 in parallel
-- VariantContextWriterStub forces BCF output for intermediate files
-- Nicer debug log message in BCF2Codec
-- Turn off debug logging of BCF2LazyGenotypesDecoder
-- BCF2FieldWriterManager now uses .debug not .info, so you won't see all of that field manager debugging info with BCF2 any longer
-- VariantContextWriterFactory.isBCFOutput now has version that accepts just a file path, not path + options
2012-08-15 21:13:15 -04:00
Eric Banks 6cb10cef96 Fixed older GS reported bug. Actually, the problem really lies in Picard (can't set max records in RAM without it throwing an exception, reported on their JIRA) so I just masked out the problem by removing this never-used argument from this rarely-used tool. 2012-07-31 16:00:36 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 7dbba465ee Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf 2012-06-26 15:28:32 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mark DePristo 8b01969762 More code cleanup and optimizations to BCF2 writer
-- Cleanup a few contracts
-- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields
-- A few simple optimizations
    -- VCF header samples stored in String[] in the writer for fast access
    -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset
    -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files
-- A few editorial comments in VCFHeader
2012-06-14 16:42:38 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo 9eb83a0771 Enable adding contigs to VariantContextWriters on output 2012-06-14 16:42:23 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 4846bf5c8e @Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf
-- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically
2012-05-24 10:57:07 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Eric Banks 3af3834d50 Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian):
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE.  Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
2012-05-18 11:55:41 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Mark DePristo b7511c5ff3 Fixed long-standing bug in tribble index creation
-- Previously, on the fly indices didn't have dictionary set on the fly, so the GATK would read, add dictionary, and rewrite the index.  This is now fixed, so that the on the fly index contains the reference dictionary when first written, avoiding the unnecessary read and write
-- Added a GenomeAnalysisEngine and Walker function called getMasterSequenceDictionary() that fetches the reference sequence dictionary.  This can be used conveniently everywhere, and is what's written into the Tribble index
-- Refactored tribble index utilities from RMDTrackBuilder into IndexDictionaryUtils
-- VCFWriter now requires the master sequence dictionary
-- Updated walkers that create VCFWriters to provide the master sequence dictionary
2011-09-20 10:53:18 -04:00
Mark DePristo 40e06f9afb Fixed broken RodBinding defaults.
-- Verified now to be correct at runtime
-- UnitTest covers this
-- createTypeDefault now takes a Type, not a Class, so that parameterized classes can have their parameter fetched in the defaults.
2011-08-11 08:58:30 -04:00
Mark DePristo 2007d2fcad Better documentation for default value fields
-- DocString function for types that create default outputs "stdout"
-- RodBinding now creates a makeUnbound default value automatically for you if your RodBinding isn't required
-- Removed warning about sparse help from TextFormattingUtils
2011-08-10 22:16:22 -04:00
Eric Banks 78aa6db076 added the 'reference' header line too. We are now header-compliant for vcf4.1. 2011-08-09 11:45:54 -04:00
Eric Banks ec76bf6d4a VCF headers now include 'contig' lines describing the name, length, and assembly (when easily parsable) for each contig in the reference. 2011-08-09 11:24:48 -04:00
Mark DePristo 79e4a8f6d3 Merge
Conflicts:
	private/java/src/org/broadinstitute/sting/gatk/walkers/qc/TestVariantContextWalker.java
	public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
	public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantDataManager.java
	public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
	public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantValidationAssessor.java
	public/java/test/org/broadinstitute/sting/gatk/walkers/recalibration/RecalibrationWalkersIntegrationTest.java
	public/java/test/org/broadinstitute/sting/gatk/walkers/recalibration/RecalibrationWalkersPerformanceTest.java
	public/java/test/org/broadinstitute/sting/gatk/walkers/varianteval/VariantEvalIntegrationTest.java
	public/java/test/org/broadinstitute/sting/utils/variantcontext/VariantContextIntegrationTest.java
2011-08-03 15:09:47 -04:00
Eric Banks 7c89fe01b3 Instead of having the padded reference base be some hackish attribute it is now an actual variable in the Variant Context class. More importantly, we now always require that it be present when padding is necessary - and validate as such upon construction of the VC. This cleans up the interface significantly because we no longer require that a reference base be passed in when writing a VC/VCF record. 2011-08-03 11:00:36 -04:00
Mark DePristo a6691ab2fd List<RodBinding<T>> now working (sort of).
At least the argument parsing system tolerates it.
2011-07-29 16:11:22 -04:00
Mark DePristo 9992c373be Optimize imports run on the whole project, public and private. I just got too tired of all of the unused imports floating around. Confirmed that the system builds after the changes. 2011-07-17 20:29:58 -04:00
David Roazen 3c9497788e Reorganized the codebase beneath top-level public and private directories,
removing the playground and oneoffprojects directories in the process. Updated
build.xml accordingly.
2011-06-28 06:55:19 -04:00