gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Mauricio Carneiro	ab53220635	Refactor on how RR treats soft clips * Sites with more soft clipped bases than regular will force-trigger a variant region * No more unclipping/reclipping, RR machinery now handles soft clips natively. * implemented support for base insertion and base deletion quality scores in synthetic and regular reads. * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present. note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!	2012-06-21 14:02:03 -04:00
Ryan Poplin	769e190202	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-20 09:59:55 -04:00
Christopher Hartl	fe1d6e3953	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-19 08:02:00 -04:00
Christopher Hartl	79ef3325bd	Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations.	2012-06-19 07:50:13 -04:00
Eric Banks	62cee2fb5b	Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature.	2012-06-18 21:36:27 -04:00
Eric Banks	4393adf9e7	If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it.	2012-06-18 13:36:14 -04:00
Ryan Poplin	707151f0a4	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 12:55:58 -04:00
Eric Banks	82a2c40338	Emit the MLE AC and AF in the INFO field of the UG output	2012-06-18 12:19:36 -04:00
Ryan Poplin	5ec737f008	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 08:51:48 -04:00
Ryan Poplin	e3147969d9	Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled.	2012-06-18 08:51:41 -04:00
Eric Banks	677babf546	Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort.	2012-06-15 15:55:03 -04:00
Eric Banks	783b7f6899	Misc cleanup	2012-06-15 10:39:19 -04:00
Eric Banks	0c218e4822	Refactoring mostly for readability (and small performance improvement)	2012-06-15 10:36:41 -04:00
Eric Banks	c54e84e739	Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations.	2012-06-15 09:28:56 -04:00
Eric Banks	61fcbcb190	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-15 02:45:57 -04:00
Eric Banks	4895fe2289	No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java)	2012-06-15 02:32:00 -04:00
Mark DePristo	0384ce5d34	Simple optimizations for BCF2Encoder -- Inline encodeString that doesn't go via List<Byte> intermediate -- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2 -- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation -- Final UG integration test update	2012-06-14 16:42:39 -04:00
Mark DePristo	68eed7b313	Optimizations for VCF and BCF2 -- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible -- Misc code refactoring -- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly -- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations	2012-06-14 16:42:39 -04:00
Mark DePristo	09df584788	Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter	2012-06-14 16:42:39 -04:00
Mark DePristo	fbc45e14d3	Cleanup formatting of VCF floats -- Final integrationtest update before commit (and fixing new formatting changes)	2012-06-14 16:42:38 -04:00
Mark DePristo	8b01969762	More code cleanup and optimizations to BCF2 writer -- Cleanup a few contracts -- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields -- A few simple optimizations -- VCF header samples stored in String[] in the writer for fast access -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files -- A few editorial comments in VCFHeader	2012-06-14 16:42:38 -04:00
Mark DePristo	e34ca0acb1	Passing all unittests -- Final merge conflicts resolved -- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that	2012-06-14 16:42:38 -04:00
Mark DePristo	71da76039e	Final support for variable length lists of strings in BCF2 -- Updating many MD5s as well.	2012-06-14 16:42:38 -04:00
Mark DePristo	bd9d40fb84	Code cleanup and more documentation for BCFFieldWriters -- Update integration tests where appropriate	2012-06-14 16:42:37 -04:00
Mark DePristo	856905ee5b	Cleanup Genotypes -- Renamed getAttribute to getExtendedAttribute, as this is really what this function does -- Added a few more genotype tests	2012-06-14 16:42:36 -04:00
Mark DePristo	31997f8092	Bugfixes on the way to passing integration tests -- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate -- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient. -- Fixed nasty subsetting bug in SelectVariants with excluding samples -- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT -- Bugfix for dropping old style GL field values -- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header -- Bugfix for Allele.getBaseString to properly show NO_CALL alleles -- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes	2012-06-14 16:42:33 -04:00
Mark DePristo	ea1b699778	Cleanup the interface for BCF2FieldEncoder -- Now uses a much clearer approach. Update all user classes to new interface	2012-06-14 16:42:33 -04:00
Mark DePristo	dd6aee347a	Genotype encoding uses the BCF2FieldEncoder system	2012-06-14 16:42:33 -04:00
Mark DePristo	9ac4203254	GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value -- Cleans up interface and avoids significant amounts of gross typing code	2012-06-14 16:42:32 -04:00
Mark DePristo	7506994d09	Nearing final BCF commit -- Cleanup some (but not all) VCF3 files. Turns out there are lots so... -- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface -- Misc. bugfixes in GenotypeBuilder	2012-06-14 16:42:32 -04:00
Mark DePristo	6272612808	Testing utility to perform diffs N times	2012-06-14 16:42:32 -04:00
Mark DePristo	8014178f2f	Algorithmically faster version of DiffEngine -- Now only includes leaf nodes in the summary, i.e., summaries of the form ".....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm -- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size. -- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well. -- Added integration test for new leaf and old pairwise calculations -- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields	2012-06-14 16:42:30 -04:00
Mark DePristo	2a86b81a3f	Initial version of clean, fast formatting routines built dynamically from a VCF header -- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level. -- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values -- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder -- Allowed us to naturally support encoding of lists of strings -- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion -- Fixes for integration test failures -- Enabling contig updates -- WalkerTest now prints out relative paths where possible to make cut/paste/run easier	2012-06-14 16:42:30 -04:00
Mark DePristo	51a3b6e25e	No more makePrecisionFormatStringFromDenominatorValue -- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating. -- Created a smart float formatter in VCFWriter, with unit tests -- Removed makePrecisionFormatStringFromDenominatorValue and its uses -- Fix broken contracted -- Refactored some code from the encoder to utils in BCF2 -- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.	2012-06-14 16:42:30 -04:00
Mark DePristo	43ad890fcc	Finalizing BCF2 v2 -- FastGenotypes are the default in the engine. Use --useSlowGenotypes engine argument to return to old representation -- Cleanup of BCF2Codec. Good error handling. Added contracts and docs. -- Added a few more contacts and docs to BCF2Decoder -- Optimized encodePrimitive in BCF2Encoder -- Removed genotype filter field exceptions -- Docs and cleanup of BCF2GenotypeFieldDecoders -- Deleted unused BCF2TestWalker -- Docs and cleanup of BCF2Types -- Faster version of decodeInts in VCFCodec -- BCF2Writer -- Support for writing a sites only file -- Lots of TODOs for future optimizations -- Removed lack of filter field support -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural -- Lots of docs and contracts -- Docs for GenotypeBuilder. More filter creation routines (unfiltered, for example) -- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters. Better genotype comparisons	2012-06-14 16:42:29 -04:00
Mark DePristo	37e5d32019	Remove logger.info statement	2012-06-14 16:42:29 -04:00
Mark DePristo	01ddf9555a	Performance optimizations for Genotype field decoding for GT field -- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over -- Contracts -- final classes	2012-06-14 16:42:28 -04:00
Mark DePristo	7fbca7013e	Don't add missing value binding from field to Genotype object in VCF3Codec	2012-06-14 16:42:28 -04:00
Mark DePristo	4a4d3cde3d	UnitTests for decodeIntArray method	2012-06-14 16:42:27 -04:00
Mark DePristo	5b8bd81991	An option to not actually write out the results of select variants -- Useful for performance testing of the SV operations themselves.	2012-06-14 16:42:26 -04:00
Mark DePristo	6f7a01e00d	Bugfix for BCF2 reader / writer for > 0x0FFF samples :-) -- Should be 0x00FFFFFF in the mask	2012-06-14 16:42:26 -04:00
Mark DePristo	1d4eb46606	Efficient reading of genotype fields v1 -- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object -- Code cleanup / contracts added were appropriate -- V2 will have a yet more optimized path...	2012-06-14 16:42:26 -04:00
Mark DePristo	37b8d70321	Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC	2012-06-14 16:42:25 -04:00
Mark DePristo	17fbd103d0	Smarter infrastructure to decode genotypes in BCF -- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder. The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2 -- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder. In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form -- Reduced the amount of data further to be computed in the DiffEngine. The DiffEngine algorithm needs to be rethought to be efficient...	2012-06-14 16:42:25 -04:00
Mark DePristo	889e3c4583	Code cleanup before major refactor	2012-06-14 16:42:25 -04:00
Mark DePristo	cebd37609c	Finalizing new Genotype object and associated routines -- Builder now provides a depreciated log10pError function to make a new GQ value -- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions -- Lots of contracts -- Bugfixes throughout	2012-06-14 16:42:25 -04:00
Mark DePristo	8b0a629a31	Terrible bugfix -- The way I was handling the contig offset ordering wasn't correct. Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict. It has nothing to do with the contig index itself. -- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly	2012-06-14 16:42:24 -04:00
Mark DePristo	d37a8a0bc8	Efficient Genotype object Intermediate commit -- Created a new Genotype interface with a more limited set of operations -- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable -- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder -- Modified lots and lots of code to use GenotypeBuilder -- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype -- Lots of bug fixes to BCF2 codec and encoder. -- Feature additions -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes -- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes -- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!	2012-06-14 16:42:24 -04:00
Mark DePristo	a648b5e65e	First step towards an efficient Genotype object -- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness. Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code. Much cleaner interface now, but not yet hooked up to anything -- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class -- Code cleanup. Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS. Uses in code replaced with constant	2012-06-14 16:42:23 -04:00
Mark DePristo	ff9ac4b5f8	BCF2 genotype decoding is now lazy -- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec. -- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec	2012-06-14 16:42:23 -04:00
Mark DePristo	9eb83a0771	Enable adding contigs to VariantContextWriters on output	2012-06-14 16:42:23 -04:00
Mark DePristo	b0ea14ef0f	VCFHeader getMetaData returns 4.1 version not 4.0	2012-06-14 16:42:22 -04:00
Mauricio Carneiro	7d12429917	First step towards indel qualities in RR Let the BI's and BD's pass through the reduce reads machinery	2012-06-14 15:37:39 -04:00
Mauricio Carneiro	e68038c5d8	Refactor post-processing downsampling using David's generic downsampler interface	2012-06-14 15:37:32 -04:00
Eric Banks	de5508fcea	Bug fixes for cycle and context covariates	2012-06-14 13:01:14 -04:00
Eric Banks	5c3c6cbc40	Long -> long conversions in BQSR	2012-06-14 09:07:02 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
Guillermo del Angel	cd2074b1dc	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 20:59:30 -04:00
Guillermo del Angel	92669a0468	Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet	2012-06-13 20:59:17 -04:00
David Roazen	0550b27799	Make downsampler classes themselves generic (instead of just the Downsampler interface) This is in response to a request from Mauricio to make it easier to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords) without having to do any cumbersome typecasting. Sadly, Java language limitations make this sort of solution the best choice. Thanks to Khalid for his feedback on this issue. Also: -added a unit test to verify GATKSAMRecord support with no typecasting required -added some unit tests for the FractionalDownsampler that Mauricio will/might be using -moved classes from private to public to better sync up with my local development branch for engine integration	2012-06-13 16:43:39 -04:00
Guillermo del Angel	67c0569f9c	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 11:50:00 -04:00
Eric Banks	81993b08e2	Don't put null entries into the key array	2012-06-13 11:43:44 -04:00
Roger Zurawicki	bdf5945dcc	Fixed bugs in DiagnoseTargets DT would not report bad mates! that has been fixed Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:26 -04:00
Roger Zurawicki	538cdf9210	Created the FindCoveredIntervals Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class Minor tweaks FindCoveredIntervals supports Gathering FindCoveredIntervals outputs an interval list instead of GATKReport Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:25 -04:00
Guillermo del Angel	aee66ab157	Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact	2012-06-13 11:14:44 -04:00
Eric Banks	37f56ce8fd	A couple of minor updates to BQSR	2012-06-12 16:12:13 -04:00
Eric Banks	277493dd83	Yet more instances of Lists changed over to native arrays	2012-06-12 15:56:09 -04:00
Eric Banks	613badc835	Very minor optimizations for the context covariate	2012-06-12 15:47:32 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	1da3e43679	Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come.	2012-06-12 13:32:56 -04:00
Eric Banks	fec0bd5e11	Fixing UG argument docs	2012-06-12 09:46:16 -04:00
Eric Banks	a4defdfb29	Adding a GT header line to SomaticIndelDetector output	2012-06-12 09:39:17 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Eric Banks	ff5749599d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 15:46:17 -04:00
Eric Banks	fea625632f	Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one	2012-06-11 15:45:58 -04:00
Ryan Poplin	e4d371dc80	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 10:38:50 -04:00
Ryan Poplin	683d4b508e	Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller.	2012-06-11 10:38:35 -04:00
Mauricio Carneiro	4aad7e23ef	New ReduceReads v2 with unclipped variant regions and soft-clipped bases * Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it. * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads * Updated all integration tests	2012-06-08 14:58:31 -04:00
Eric Banks	afa9b2718a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-08 13:54:48 -04:00
Eric Banks	92280b4068	BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime.	2012-06-08 13:54:37 -04:00
Eric Banks	898a0e6161	Minor optimizations	2012-06-08 12:07:58 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	d463ab2cbf	BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible.	2012-06-08 10:42:42 -04:00
Eric Banks	2bd48a7351	Bad comments made it into the previous commit	2012-06-07 23:12:56 -04:00
Eric Banks	31c3a6be48	BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR.	2012-06-07 20:04:10 -04:00
Eric Banks	0fb9179f76	BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array	2012-06-07 19:41:03 -04:00
Ryan Poplin	d449f169d3	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-07 10:56:55 -04:00
Ryan Poplin	0b4281fdd0	misc minor update to HC debug output for when there are a lot of samples	2012-06-07 10:56:41 -04:00
Eric Banks	bad50a1b05	Fix docs	2012-06-06 22:45:38 -04:00
Eric Banks	b093ba9dcc	Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records)	2012-06-06 15:17:30 -04:00
Eric Banks	54f682a99c	Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX.	2012-06-06 11:44:37 -04:00
Eric Banks	dd46d843fb	IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion.	2012-06-06 11:04:55 -04:00
Guillermo del Angel	2cbd6e5f90	Merged bug fix from Stable into Unstable	2012-06-05 15:58:23 -04:00
Guillermo del Angel	ce4dc2128d	Adding minor clarification to -mbq argument documentation	2012-06-05 15:17:56 -04:00
Eric Banks	e02ec8c8b6	Don't update the record ID unless we are actually going to emit the record	2012-06-04 14:58:50 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Ryan Poplin	f11e7ebc3a	Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA.	2012-06-04 12:49:36 -04:00
Ryan Poplin	320956ee4b	Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage.	2012-06-04 10:55:36 -04:00
Guillermo del Angel	7a54baf08c	Merged bug fix from Stable into Unstable	2012-06-03 08:42:08 -04:00
Guillermo del Angel	47df7bbc14	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-06-03 08:38:54 -04:00
Guillermo del Angel	2ddbdee3bc	Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow	2012-06-03 08:38:38 -04:00
Mauricio Carneiro	12a8c54f9a	Fixing VCF header for filter elements (thanks Eric)	2012-06-01 15:45:15 -04:00
Eric Banks	3a15ba2102	Malformed VCF headers should be User Errors	2012-05-31 16:05:53 -04:00
Khalid Shakir	c4f7df4dce	When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>".	2012-05-30 16:39:06 -04:00
Ryan Poplin	421d0d1435	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-30 15:21:35 -04:00
Ryan Poplin	5dd811f84a	Adding genotype given alleles mode to the HaplotypeCaller.	2012-05-30 15:07:01 -04:00
Eric Banks	d09b8d5584	Fixing docs	2012-05-30 13:24:08 -04:00
Mauricio Carneiro	d6e1205310	Updating default values for DiagnoseTargets	2012-05-30 12:43:07 -04:00
Khalid Shakir	c3c7f17d90	Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000. Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run. TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.	2012-05-29 11:18:22 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Eric Banks	50031b63c5	Fix possible NPE from NBaseCount annotation module	2012-05-29 09:46:00 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	7ce24a96f1	PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present	2012-05-28 20:18:16 -04:00
Mark DePristo	1818c29371	Fixed long-standing bug in beagle codec that was passing on the header record for decoding	2012-05-28 20:17:26 -04:00
Mark DePristo	5894d045cb	Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests -- This version of BCF should actually work properly for most files, assuming headers are properly defined. -- Lots of bug fixes to BCF2 codec -- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL. NOTE THIS SEMANTICS change -- Equals() method for GenotypeLikelihoods, using PLs. -- VCFCodec now longer adds empty bindings to missing input field values. NOTE THIS CHANGE -- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work. The BCF2 codec now makes VCs marked as fully decoded -- stringToBytes returns empty list for null or "" string in BCF2Encoder -- Proper handling of genotype ordering in BCF2 reader / writer -- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily -- Many failing MD5s now due to double -> int change in GQ, will update later	2012-05-27 11:17:17 -04:00
Mark DePristo	86e5a066fc	Even more conservative limit on number of differences to summarize at 1000	2012-05-27 11:17:13 -04:00
Mark DePristo	31f4e5b52e	Stop unlimited runtimes in DiffEngine when you have lots of differences -- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000	2012-05-27 11:17:13 -04:00
Mauricio Carneiro	4109fcbb08	Merged bug fix from Stable into Unstable	2012-05-25 13:03:05 -04:00
Mauricio Carneiro	2be5704a25	Fixed haplotype boundary bug in PairHMMIndelErrorModel haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception). * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate. * removed unnecessary code to remove hard clips after processing. * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).	2012-05-25 13:00:45 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	ade1843818	Bugfix for not setting header in AbstractVCFCodec	2012-05-24 10:58:58 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	69ee4d0454	Moved getMetaDataForField to VariantContextUtils	2012-05-24 10:57:09 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Mark DePristo	4bde24f020	Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header	2012-05-24 10:57:08 -04:00
Mark DePristo	4846bf5c8e	@Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf -- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically	2012-05-24 10:57:07 -04:00
Mark DePristo	bb0d87666a	Finally just deleted equals() method in GATKArgumentCollection. -- We never compare these things in the codebase anyway...	2012-05-24 10:57:07 -04:00
Mark DePristo	c8ed0bfc4c	Edge case fixes for BCF2 --handle entirely missing GT in a sample in decodeGenotypeAlleles --Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code -- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples -- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys -- Removed restriction on getPloidy requiring ploidy > 1. It's logically find to return 0 for a no called sample -- getMaxPloidy() in VC that does what it says -- Support for padding / depadding of generic genotype fields	2012-05-24 10:57:06 -04:00
Mark DePristo	40431890be	-- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it -- BCF2 writer can now work without the contig lines being in the header -- Made GenomeLocParser a final class	2012-05-24 10:57:06 -04:00
Mark DePristo	6301572009	GenotypeLikelihood PLs are capped at Short.MAX_INT now -- UserExceptions in BCF2 now where appropriate -- Asserts for code safety -- Public -> protected encode(Object v) method is for testing only	2012-05-24 10:57:06 -04:00
Mark DePristo	d52bc31a47	Bugfix for doNotWriteGenotypes mode -- Was outputing GT ./. in sites only mode. Fixed	2012-05-24 10:57:05 -04:00
Mark DePristo	64d4238e2f	99% working version of BCF2 encoder / decoder -- fixed final bugs with PL encoding / decoding -- Ready for testing by other members of the group -- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations -- Fixed a nasty bug in the filter field -- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types Read 1500 genotypes file in VCF -> VCF : 11 seconds Read 1500 genotypes file in VCF -> BCF : 9.5 seconds VariantEval 1500 genotypes file in VCF : 3 seconds VariantEval 1500 genotypes file in BCF : 3 seconds	2012-05-24 10:57:05 -04:00
Mark DePristo	b5bce8d3f9	AD should be UNBOUNDED, actually -- Pass in # alt alleles as appropriate for getCount in VCF header line	2012-05-24 10:57:05 -04:00
Mark DePristo	aaf11f00e3	Near final BCF2 implementation -- Trivial import changes in some walkers -- SelectVariants has a new hidden mode to fully decode a VCF file -- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type -- GenotypeLikelihoods now implements List<Double> for convenience. The PL duality here is going to be removed in a subsequent commit -- BugFixes in BCF2Writer. Proper handling of padding. Bugfix for nFields for a field -- padAllele function in VariantContextUtils -- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.	2012-05-24 10:57:02 -04:00
Mark DePristo	dfee17a672	Generalize / unify code for handling strings -- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder. -- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding -- Code cleanup and renaming	2012-05-24 10:57:02 -04:00
Mark DePristo	b4a5acd6f4	Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't	2012-05-24 10:57:01 -04:00
Mark DePristo	373ae39e86	Testing of BCF codec -- Rev.d tribble -- Minor code cleanup -- BCF2 encoder / decoder use Double not Float internally everywhere -- Generalized VC testing framework	2012-05-24 10:57:01 -04:00
Mark DePristo	fb1911a1b6	-- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder -- Convenience routine for creating alleles from strings of bases -- Convenience constructor for VCFFilterHeader line whose description is the same as name -- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes. Can be reused throughtout code for BCF, VCF, etc. -- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go	2012-05-24 10:57:01 -04:00
Mark DePristo	4968dcd36a	Throw an error when genotype fields with mixed vector lengths are encountered	2012-05-24 10:57:00 -04:00
Mark DePristo	afd2f1a3f9	Individual VariantContextWriters are now package protected -- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it -- Update build.xml to build vcf.jar with updated paths and bcf2 support.	2012-05-24 10:57:00 -04:00
Mark DePristo	24864fd5b0	GATK now writes BCF output to any file with .bcf extension -- Moved VCF and BCF writers to variantcontext.writers -- Updated vcf.jar build path -- Refactored VCFWriter and other code. Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory. Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.	2012-05-24 10:57:00 -04:00
Mark DePristo	e2311294c0	Removed unused ManualSortingVCFWriter	2012-05-24 10:56:59 -04:00
Mark DePristo	93cef82637	BCF2 header encoding decoding at final spec	2012-05-24 10:56:58 -04:00
Mark DePristo	ce9e9eebb1	No dictionary in header. Now built dynamically from the header in the writer and codec -- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there	2012-05-24 10:56:58 -04:00
Mark DePristo	c3b8048e2e	Moving around classes in VCF and BCF2 -- Refactored VCF writers into vcf.writers package -- Moved BCF2Writer to bcf2.writer -- Updates to all of the walkers using VCFWriter to reflect new packages -- A large number of files had their headers cleaned up because of this as well	2012-05-24 10:56:58 -04:00
Mark DePristo	679ffdd333	Move BCF2 from private utils to public codecs	2012-05-24 10:56:56 -04:00
Mark DePristo	450f098a61	BCF2 encoder / decoder implement new site / genotype block organization -- Supports final organization of data blocks into sites data and genotypes data	2012-05-24 10:56:55 -04:00
Mark DePristo	27b51d4dea	Enable on the fly indexing of BCF2	2012-05-24 10:56:54 -04:00
Mark DePristo	81bd7646d6	Fix for MISSING floats -- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int). -- Now handles correctly MISSING qual fields	2012-05-24 10:56:53 -04:00
Mark DePristo	3afbc50511	More BCF2 improvements -- Refactored setting of contigs from VCFWriterStub to VCFUtils. Necessary for proper BCF working -- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order. -- Cleaned up VCFHeader operations -- BCF now uses the right header files correctly when encoding / decoding contigs -- Clean up unused tools -- Refactored header parsing routines to make them more accessible -- More minor header changes from Intellij	2012-05-24 10:56:52 -04:00
Mark DePristo	0799855479	Archiving GCF -- Rider update to CramByPiece.scala	2012-05-24 10:56:51 -04:00
Guillermo del Angel	43919078cd	Merged bug fix from Stable into Unstable	2012-05-23 21:21:01 -04:00
Guillermo del Angel	4bc04e2a9e	Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler.	2012-05-23 21:19:30 -04:00
Ryan Poplin	08dfd6cab6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-21 16:47:07 -04:00
Ryan Poplin	04000d920c	Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads.	2012-05-21 16:46:59 -04:00
Eric Banks	666862af19	Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs	2012-05-21 16:03:29 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Eric Banks	7f5ec17d22	Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table.	2012-05-21 14:16:13 -04:00
Eric Banks	92d8aa3d4c	Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels	2012-05-21 09:38:52 -04:00
Eric Banks	3af3834d50	Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian): * For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE. Switched over to booleans. * We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.	2012-05-18 11:55:41 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	5189b06468	New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality.	2012-05-17 15:28:19 -04:00
Eric Banks	0f7c917e7a	Better error checking and messages for bad alleles	2012-05-17 13:36:42 -04:00
Eric Banks	d44886d9e8	Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed.	2012-05-15 10:29:04 -04:00
Eric Banks	819c3d0c15	Adding to the Hrun docs	2012-05-15 10:27:52 -04:00
Guillermo del Angel	5fc3adbb04	One more VariantsToTable bug fix	2012-05-14 14:10:07 -04:00
Guillermo del Angel	04d691f04a	Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit	2012-05-14 14:01:29 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00
Ryan Poplin	c9dd0f3173	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-10 13:09:10 -04:00
Ryan Poplin	0cdadffe14	Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes.	2012-05-10 13:09:03 -04:00
Guillermo del Angel	27b1aa5dd3	Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's	2012-05-10 10:29:19 -04:00
Eric Banks	4f37d6d399	Fixing docs	2012-05-10 00:56:00 -04:00
Mark DePristo	c81acfc15d	Working implementation of BCF2 -- Nearly complete on spec implementation. Slow but clean -- Some refactoring of VariantContext to support common functions for BCF and VCF	2012-05-08 19:46:51 -04:00
Mark DePristo	a5193c2399	Mostly complete reference implementation of BCF2 -- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF	2012-05-08 19:46:51 -04:00
Eric Banks	473d07b0c5	fixing up docs from previous Pool Caller commit	2012-05-08 11:02:55 -04:00
Eric Banks	b4999d14c1	updating docs	2012-05-08 10:58:46 -04:00
Guillermo del Angel	33a1dd2048	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 10:42:12 -04:00
Eric Banks	5cf4fd63c2	Catch malformed base qualities and throw as a User Error	2012-05-08 09:34:57 -04:00
Guillermo del Angel	a4f4b5007b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 09:34:33 -04:00
Guillermo del Angel	605984353f	Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model.	2012-05-08 09:33:38 -04:00
Eric Banks	c40cda7e3c	Nope, loads of integration tests had to be changed.	2012-05-07 14:30:42 -04:00
Eric Banks	66838a073e	Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed...	2012-05-07 12:20:11 -04:00
David Roazen	6b769e91d8	BCF2: third checkpoint * writer mostly implemented * walkers to convert BCF2 <-> VCF * almost working for sites-only files; genotypes still need work * initial performance tests this afternoon will be on sites-only files	2012-05-04 13:00:15 -04:00
Eric Banks	f3433201b1	Merged bug fix from Stable into Unstable	2012-05-03 11:11:00 -04:00
Eric Banks	557da77a1a	Don't compute QD if there is no QUAL; added integration test for this	2012-05-03 11:02:37 -04:00
Eric Banks	1fc7b5d58b	Merged bug fix from Stable into Unstable	2012-05-03 10:37:58 -04:00
Laurent Francioli	567d01cee8	- Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:49 -04:00
Laurent Francioli	96e5a26223	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:20 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Khalid Shakir	b8b7f28aa9	Revving Picard to pick up new SamFileHeaderMerger. Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut(). In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124. - Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs. - Updated md5s to match new expectations after looking at TLEN diff engine output.	2012-05-02 16:47:28 -04:00
Mauricio Carneiro	f51a1d0d61	Better error message to the BAMScheduler In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.	2012-05-02 16:10:00 -04:00
Mauricio Carneiro	940029fa5d	Fixing on-the-fly recalibration (caught by Ryan) low quality bases in the tails were being turned to N's in the final read.	2012-05-02 16:06:04 -04:00
Eric Banks	623b36fbc4	Add header lines for AC,AF, and AN tags	2012-05-02 15:33:34 -04:00
Guillermo del Angel	429800a192	Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value.	2012-05-02 09:57:06 -04:00
Guillermo del Angel	76a95fdedf	Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though).	2012-05-02 09:24:28 -04:00
Joel Thibault	4d732fa586	Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb	2012-05-01 18:23:51 -04:00
Eric Banks	619a69a5f1	As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).	2012-05-01 16:18:24 -04:00
Joel Thibault	c255dd5917	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:10:38 -04:00
Ryan Poplin	51af61b5d7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:07:23 -04:00
Ryan Poplin	fc55dcec3c	Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now.	2012-05-01 16:02:36 -04:00
Ryan Poplin	20a0078f23	Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big.	2012-05-01 15:51:36 -04:00
Eric Banks	0f3af9555b	Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.	2012-05-01 14:58:06 -04:00
Joel Thibault	aa4d41cce0	Minor cleanup before push	2012-05-01 14:16:44 -04:00
Joel Thibault	b101b9c30b	Add Mongo switch	2012-05-01 14:00:48 -04:00
Joel Thibault	1b609e9075	Move Mongo to server couchdb	2012-05-01 13:59:47 -04:00
Joel Thibault	fd57d27f45	Move MongoDB connection handling to a separate class	2012-05-01 13:59:37 -04:00
Joel Thibault	db3cd1abd5	Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes.	2012-05-01 13:57:23 -04:00
Joel Thibault	04e1be9106	Better handling of Mongo errors + exceptions	2012-05-01 13:57:23 -04:00
Joel Thibault	ca737479cf	Query for stop locations because we don't have that information in the reference	2012-05-01 13:57:23 -04:00
Joel Thibault	1cda87a4ad	Set ROD priority list to input	2012-05-01 13:57:23 -04:00
Joel Thibault	a7fe847faf	Set the priority list and don't bother combining if not needed	2012-05-01 13:57:23 -04:00
Joel Thibault	f739305f43	Combine the variants found at a location	2012-05-01 13:57:23 -04:00
Joel Thibault	020f884d5a	Use new key of source ROD plus alleles	2012-05-01 13:57:23 -04:00
Joel Thibault	221ce9c3d6	Add alleles to the primary key	2012-05-01 13:57:23 -04:00
Joel Thibault	3198ce5471	Can have multiple variants at a location	2012-05-01 13:57:22 -04:00
Joel Thibault	11ed8e61c9	Add referenceBaseForIndel to the Mongo VariantContext objects	2012-05-01 13:53:44 -04:00
Joel Thibault	7ed0ee7ed0	Skip locations with no genotypes instead of throwing a NPE	2012-05-01 13:53:44 -04:00
Joel Thibault	4bdfeacdaa	Handle multiple samples/genotypes per location TODO: sample selection	2012-05-01 13:53:43 -04:00
Joel Thibault	1f7c628796	Insert the ROD filename into MongoDB as part of the primary key	2012-05-01 13:53:43 -04:00
Joel Thibault	bb8a6e9b0a	Initial test of write and read from MongoDB	2012-05-01 13:53:43 -04:00
David Roazen	c0084c741b	Pilot BCF2 Implementation: Checkpointing the code * Not working yet, still very much a work-in-progress with lots of placeholders * Needed to check this in to enable possible collaboration, since it's going slower than anticipated and the conference deadline looms.	2012-05-01 12:23:10 -04:00
Christopher Hartl	7d029b9a28	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-30 12:16:30 -04:00
Christopher Hartl	944a7d815e	Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels.	2012-04-28 11:31:03 -04:00
Ryan Poplin	54a9bc2da2	Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles.	2012-04-28 09:12:26 -04:00
Ryan Poplin	e332aeaf70	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 16:21:21 -04:00
Ryan Poplin	2b5dd28550	Bug fix in reverse trim alleles for the case of mixed records.	2012-04-27 16:21:02 -04:00
Mauricio Carneiro	1db2d1ba82	Do not add the first and last 4 cycles to the recalibration tables.	2012-04-27 15:18:07 -04:00
Mauricio Carneiro	08dbd756f3	Quick QC walkers to look at the error profile of indels in the read	2012-04-27 15:18:07 -04:00
Guillermo del Angel	730208133b	Several fixes and improvements to Pool caller with ancillary test functions (not done yet): a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value. b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time. c) Expand unit tests and add an exhaustive test for ErrorModel class. d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10. e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases. f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done). g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model. h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math	2012-04-27 14:41:17 -04:00
Eric Banks	0439047269	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 10:49:45 -04:00
Eric Banks	05b44dd017	The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods.	2012-04-27 10:49:36 -04:00
Khalid Shakir	9801dd114f	Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag() Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.	2012-04-27 09:58:38 -04:00
Guillermo del Angel	972d6531b6	Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case.	2012-04-26 10:15:26 -04:00
Laurent Francioli	219b0a128b	PED support for ChromosomeCounts annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:50:04 -04:00
Laurent Francioli	19d5213d5a	Added function to get founders IDs in SampleDB Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:49:36 -04:00
Mauricio Carneiro	902277856e	fix for RBP getPileupsForSamples() do not differentiate per sample pileups from generic pileups. Do the same for both -- it's O(n) either way.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	82b4798913	CountBasesWalker -- a quick QC walker.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	e440d0ce69	BQSR triage #4 * fixed queue script plot file names * updated the ReadGroupCovariate to use the platform unit instead of sample + lane. * fixed plotting of marginalized reported qualities	2012-04-24 17:19:54 -04:00
Eric Banks	d6277b70d8	Forgot to consider the optimized case in hasAllele	2012-04-24 11:32:28 -04:00
Eric Banks	91bad244d5	Using a VCF whose ALT is the reference in GGA mode is a User Error	2012-04-24 11:08:37 -04:00
Eric Banks	74ad008163	Adding VariantContext.hasAlternateAllele functionality	2012-04-24 11:07:46 -04:00
Eric Banks	66f3315548	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-24 09:39:55 -04:00

... 3 4 5 6 7 ...

2168 Commits (6f8e7692d4e67d5cfb2d2f9e33549ff4fbf2e5f8)