gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	63f5262e45	mergeInfoWithMaxAC is no longer hidden in CombineVariants	2012-07-08 15:44:32 -07:00
Mark DePristo	66aee613e2	Bugfix for set key in mergeInfoWithMaxAC. -- Previous version was always setting set=source of info with highest AC. Should actually have been set to the set annotation value itself.	2012-07-08 15:44:32 -07:00
Mark DePristo	91f0ed8059	Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available	2012-07-08 15:44:32 -07:00
Mark DePristo	87b090c362	Update VariantRecalibator error message to use -resource not old -B syntax	2012-07-08 15:44:31 -07:00
Mauricio Carneiro	125e6c1a47	added BinaryTagCovariate for ancient dna analysis	2012-07-06 15:03:20 -04:00
Mauricio Carneiro	e93b025b39	Fixing unit test with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.	2012-07-06 12:08:09 -04:00
Mauricio Carneiro	f603d4c48c	Fixing PairHMMIndelErrorModel boundary issue When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.	2012-07-06 11:48:04 -04:00
Eric Banks	dd571d9aa0	Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags.	2012-07-04 01:22:20 -04:00
Eric Banks	33306d2e20	Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off.	2012-07-04 00:21:21 -04:00
Eric Banks	7d30558e6f	Only 'pad' the cycle covariate for indels, not substitutions	2012-07-03 23:47:01 -04:00
Mauricio Carneiro	17efbbf8b1	Fixed ReadClipperUnitTest The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.	2012-07-03 16:38:51 -04:00
Eric Banks	22f1afddaa	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-03 14:55:59 -04:00
Eric Banks	617eebd204	More misc cleanup	2012-07-03 14:55:37 -04:00
Eric Banks	344c3aeb1d	Cleanup from previous commit	2012-07-03 14:42:44 -04:00
Ryan Poplin	9e8e78de15	Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels.	2012-07-03 14:30:37 -04:00
Eric Banks	0b37d44b0d	Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup.	2012-07-03 13:05:11 -04:00
Eric Banks	031322ff00	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-03 00:12:59 -04:00
Eric Banks	a4670113bd	Refactored/renamed the nested integer array; cleaned up code a bit.	2012-07-03 00:12:33 -04:00
Ryan Poplin	f92139dd82	Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now.	2012-07-02 21:12:42 -04:00
Ryan Poplin	7e7b4cd1b9	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-02 16:37:54 -04:00
Ryan Poplin	b807ff63ef	HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test.	2012-07-02 16:37:39 -04:00
Mauricio Carneiro	3cea080aa8	Cache SoftStart() and SoftEnd() in the GATKSAMRecord these are costly operations when done repeatedly on the same read.	2012-07-02 16:22:00 -04:00
Mauricio Carneiro	88a02fa2cb	Fixing but for reads with cigars like 9S54H When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.	2012-07-02 16:22:00 -04:00
Mark DePristo	1b0a775773	Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest	2012-07-02 15:55:42 -04:00
Eric Banks	cac72bce91	Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit.	2012-07-02 14:33:33 -04:00
Mark DePristo	602729c09d	Moved parallel tests from SelectVariants to separate SelectVariantsParallelIntegrationTest -- Enabled previous tests -- all now working -- Added modern test against new VCF as well	2012-07-02 11:40:28 -04:00
Mark DePristo	bcd2e13d8b	Adding duplicate header line keys is a logger.debug not logger.warn message now	2012-07-02 11:39:34 -04:00
Mark DePristo	01e04992f8	Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output	2012-07-02 11:38:59 -04:00
Eric Banks	c94c8a9c09	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-02 08:53:01 -04:00
Mark DePristo	7aff4446d4	Added unit tests for header repairing capabilities in the GATK engine	2012-07-01 15:38:10 -04:00
Mark DePristo	480b32e759	BCF2 is now officially zero-based open-interval, and that's how the GATK does it now	2012-07-01 14:59:27 -04:00
Ryan Poplin	b6093ff02c	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-01 10:32:37 -04:00
Mark DePristo	9b87dcda4f	Fixing remaining integration test errors. Adding missing ex2.bcf	2012-06-30 16:23:11 -04:00
Mark DePristo	5ad9a98a15	Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations -- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order -- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values	2012-06-30 11:22:49 -04:00
Mark DePristo	385a3c630f	Added check in VariantContext.validate to ensure that getEnd() == END value when present -- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward -- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF	2012-06-30 11:22:48 -04:00
Mark DePristo	893630af53	Enabling symbolic alleles in BCF2 -- Bugfix for VCFDiffableReader: don't add null filters to object -- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles -- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication. Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele -- Brand new home for allele clipping / padding routines in VCFAlleleClipper. Actually documented this code, which results in lots of **** negative comments on the code quality. Eric has promised that he and Ami are going to rethink this code from scratch. Fixed many nasty bugs in here, cleaning up unnecessary branches, etc. Added UnitTests in VCFAlleleClipper that actually test the code full. In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed. Even introduced some minor optimizations -- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason. Fixed. -- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles --	2012-06-30 11:22:48 -04:00
Mark DePristo	16276f81a1	BCF2 with support symbolic alleles -- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys -- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest	2012-06-30 11:22:48 -04:00
Mark DePristo	86feea917e	Updating MD5s to reflect new FT fixed count of 1 not UNBOUNDED	2012-06-30 11:22:47 -04:00
Mark DePristo	6bea28ae6f	Genotype filters are now just Strings, not Set<String>	2012-06-30 11:22:47 -04:00
Guillermo del Angel	f631be8d80	UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function	2012-06-29 23:51:52 -04:00
Guillermo del Angel	65037b87da	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-29 11:08:44 -04:00
Guillermo del Angel	5a9a37ba01	Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb)	2012-06-29 11:08:16 -04:00
Eric Banks	96ea334bf2	Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit.	2012-06-28 15:27:44 -04:00
Ryan Poplin	05791ebf80	Adding the Clipping rank sum test: If alternate-supporting reads have more hard clipping than reference-supporting reads this is evidence for error.	2012-06-28 13:22:56 -04:00
Ryan Poplin	d12ec92a55	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-28 12:57:59 -04:00
Ryan Poplin	5bb0693888	Bug fix for HC GGA mode. Shouldn't try to add an indel into the haplotype if that haplotype already contains the event of interest. Misc minor assembly param changes. Turning off capping of base qualities by base indel qualities until we can evaluate that change.	2012-06-28 12:57:51 -04:00
Khalid Shakir	1ce0b9d519	Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified.	2012-06-28 11:28:04 -04:00
Mark DePristo	734bb5366b	Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples -- numLikelihoods no longer allows even ploidy == 0 in requires -- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2	2012-06-28 10:06:07 -04:00
Mark DePristo	064cc56335	Update integration tests to reflect new FT header line standard and new DiagnoseTargets field names	2012-06-28 10:06:06 -04:00
Mark DePristo	64d7e93209	Massive bugfixes -- Previous version was reading the size of the encoded genotypes vector for each genotype. This only worked because I never wrote out genotype field values with > 15 elements. Mauricio's killer DiagnoseTargets VCF uncovered the bug. Unfortunately since symbolic allele clipping is still busted those tests are still diabled -- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.	2012-06-28 10:06:06 -04:00
Mark DePristo	7144154f53	VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header -- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously. -- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects -- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC -- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function	2012-06-28 10:06:06 -04:00
Mark DePristo	4811a00891	GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine	2012-06-28 10:06:05 -04:00
Mark DePristo	93426a44b1	Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint -- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int -- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK. Genotype filters should be empty for PASSing sites	2012-06-28 10:06:05 -04:00
Eric Banks	dc7636b923	Refactor the ContextCovariate to significantly reduce runtime	2012-06-28 02:29:35 -04:00
Eric Banks	1fafd9f6c8	NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation.	2012-06-27 16:55:49 -04:00
Khalid Shakir	746a5e95f3	Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding. Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals. IntervalUtils return unmodifiable lists so that utilities don't mutate the collections. Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus. Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values. JobRunInfo handles the null done times when jobs crash with strange errors.	2012-06-27 01:15:22 -04:00
Mark DePristo	016b25be87	Update annoying md5s in unit tests, also failing because of header fixing	2012-06-26 17:32:42 -04:00
Mark DePristo	cd32b6ae54	CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect	2012-06-26 17:32:42 -04:00
Mark DePristo	1f45551a15	Bugfixes to G count types in VCF header -- Previously VCF header lines of count type G assumed that the sample would be diploid. -- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class -- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class -- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations -- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples -- Added extensive unit tests that compare A and G type values in genotypes	2012-06-26 15:28:34 -04:00
Mark DePristo	7ef5ce28cc	VariantRecalibrator test currently doesn't work with shadowBCF	2012-06-26 15:28:34 -04:00
Mark DePristo	5f5885ec78	Updating many MD5s to reflect correct fixed headers -- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader. Updating md5s to reflect this.	2012-06-26 15:28:34 -04:00
Mark DePristo	39c849aced	Bugfix to ensure the DB=1 old files decode properly	2012-06-26 15:28:33 -04:00
Mark DePristo	c1ac0e2760	BCF2 cleanup -- allowMissingVCFHeaders is now part of -U argument. If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING. Updated lots of files to use this -- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup. This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.	2012-06-26 15:28:33 -04:00
Mark DePristo	0b5980d7b3	Added Heng's nasty ex2.vcf to standard tests	2012-06-26 15:28:33 -04:00
Mark DePristo	11dbfc92a7	Horrible bugfix to decodeLoc() in BCF2Codec -- Just completely wrong. -- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem -- Added samtools ex2.bcf file for decoding to our integrationtests	2012-06-26 15:28:32 -04:00
Mark DePristo	fb26c0f054	Update integration tests to reflect header changes	2012-06-26 15:28:32 -04:00
Mark DePristo	7b96263f8b	Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet	2012-06-26 15:28:32 -04:00
Mark DePristo	7dbba465ee	Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf	2012-06-26 15:28:32 -04:00
Mark DePristo	6e9a81aabe	Minor bugfix -- now that the testfile is in our testdata regenerate the idx file as needed to pass tests	2012-06-26 15:28:32 -04:00
Roger Zurawicki	7eb3e4da41	Added integration Tests for DiagnoseTargets Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-25 17:02:46 -04:00
Joel Thibault	f0c54d99ed	Account for a null attributes object * field attributesCanBeModified - a null attributes object can't be modified in its current state * method makeAttributesModifiable() - initialize a null attributes object to empty	2012-06-25 12:07:36 -04:00
Joel Thibault	d0cf8bcc80	Add unit tests for VariantContextBuilder.rmAttribute() and .attribute() * These generated NPEs when the attribute object is null	2012-06-25 12:05:04 -04:00
Joel Thibault	fd9effbfe2	Fix Exception typo	2012-06-25 12:05:04 -04:00
Ryan Poplin	429ad44421	Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start.	2012-06-22 14:53:29 -04:00
Ryan Poplin	735b59d942	Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero.	2012-06-22 12:38:48 -04:00
Ryan Poplin	0650b349d7	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-22 10:42:49 -04:00
Guillermo del Angel	eed32df30d	a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this	2012-06-21 21:19:55 -04:00
Mark DePristo	d17369e0ac	A few misc. residual errors in last commit	2012-06-21 16:04:25 -04:00
Mark DePristo	734756d6b2	Final fixes before BCF2 mark III push -- Added MLEAC and MLEAF format lines to PoolCallerWalker -- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation -- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing	2012-06-21 15:17:22 -04:00
Mark DePristo	31ee8aa01a	JEXL update -- Update to 2.1.1 from 2.0 -- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching. So "AF < 0.5" works even in the presence of multi-allelics now. --	2012-06-21 15:17:21 -04:00
Mark DePristo	549293b6f7	Bugfixes towards final BCF2 implementation -- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF -- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing -- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one -- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing -- Keys whose value is null are put into the VariantContext info attributes now	2012-06-21 15:17:21 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	0c8b830db7	Updating MD5s for inclusion of RPA field header	2012-06-21 15:16:26 -04:00
Mark DePristo	d015a5738d	Bugfixes for VCFWriterUnitTest and TestProvider to deal with stricter VCFWriter behavior	2012-06-21 15:16:26 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Mauricio Carneiro	ab53220635	Refactor on how RR treats soft clips * Sites with more soft clipped bases than regular will force-trigger a variant region * No more unclipping/reclipping, RR machinery now handles soft clips natively. * implemented support for base insertion and base deletion quality scores in synthetic and regular reads. * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present. note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!	2012-06-21 14:02:03 -04:00
Ryan Poplin	769e190202	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-20 09:59:55 -04:00
Christopher Hartl	fe1d6e3953	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-19 08:02:00 -04:00
Christopher Hartl	79ef3325bd	Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations.	2012-06-19 07:50:13 -04:00
Eric Banks	15ae906f32	Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today.	2012-06-18 21:54:58 -04:00
Eric Banks	62cee2fb5b	Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature.	2012-06-18 21:36:27 -04:00
Eric Banks	4393adf9e7	If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it.	2012-06-18 13:36:14 -04:00
Ryan Poplin	707151f0a4	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 12:55:58 -04:00
Eric Banks	82a2c40338	Emit the MLE AC and AF in the INFO field of the UG output	2012-06-18 12:19:36 -04:00
Ryan Poplin	5ec737f008	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 08:51:48 -04:00
Ryan Poplin	e3147969d9	Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled.	2012-06-18 08:51:41 -04:00
Eric Banks	677babf546	Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort.	2012-06-15 15:55:03 -04:00
Eric Banks	783b7f6899	Misc cleanup	2012-06-15 10:39:19 -04:00
Eric Banks	0c218e4822	Refactoring mostly for readability (and small performance improvement)	2012-06-15 10:36:41 -04:00
Eric Banks	c54e84e739	Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations.	2012-06-15 09:28:56 -04:00
Eric Banks	61fcbcb190	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-15 02:45:57 -04:00
Eric Banks	4895fe2289	No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java)	2012-06-15 02:32:00 -04:00
Mark DePristo	5c23ab0817	Final cleanup of VCFWriterUnitTest	2012-06-14 16:42:39 -04:00
Mark DePristo	0384ce5d34	Simple optimizations for BCF2Encoder -- Inline encodeString that doesn't go via List<Byte> intermediate -- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2 -- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation -- Final UG integration test update	2012-06-14 16:42:39 -04:00
Mark DePristo	68eed7b313	Optimizations for VCF and BCF2 -- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible -- Misc code refactoring -- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly -- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations	2012-06-14 16:42:39 -04:00
Mark DePristo	09df584788	Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter	2012-06-14 16:42:39 -04:00
Mark DePristo	fbc45e14d3	Cleanup formatting of VCF floats -- Final integrationtest update before commit (and fixing new formatting changes)	2012-06-14 16:42:38 -04:00
Mark DePristo	8b01969762	More code cleanup and optimizations to BCF2 writer -- Cleanup a few contracts -- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields -- A few simple optimizations -- VCF header samples stored in String[] in the writer for fast access -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files -- A few editorial comments in VCFHeader	2012-06-14 16:42:38 -04:00
Mark DePristo	e34ca0acb1	Passing all unittests -- Final merge conflicts resolved -- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that	2012-06-14 16:42:38 -04:00
Mark DePristo	71da76039e	Final support for variable length lists of strings in BCF2 -- Updating many MD5s as well.	2012-06-14 16:42:38 -04:00
Mark DePristo	bd9d40fb84	Code cleanup and more documentation for BCFFieldWriters -- Update integration tests where appropriate	2012-06-14 16:42:37 -04:00
Mark DePristo	dc07067265	Fix bug in incorrectly reporting relative paths in log	2012-06-14 16:42:37 -04:00
Mark DePristo	856905ee5b	Cleanup Genotypes -- Renamed getAttribute to getExtendedAttribute, as this is really what this function does -- Added a few more genotype tests	2012-06-14 16:42:36 -04:00
Mark DePristo	aa2178cc68	Updating MD5s to latest version to reflect inclusion of contigs in headers	2012-06-14 16:42:36 -04:00
Mark DePristo	31997f8092	Bugfixes on the way to passing integration tests -- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate -- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient. -- Fixed nasty subsetting bug in SelectVariants with excluding samples -- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT -- Bugfix for dropping old style GL field values -- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header -- Bugfix for Allele.getBaseString to properly show NO_CALL alleles -- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes	2012-06-14 16:42:33 -04:00
Mark DePristo	ea1b699778	Cleanup the interface for BCF2FieldEncoder -- Now uses a much clearer approach. Update all user classes to new interface	2012-06-14 16:42:33 -04:00
Mark DePristo	dd6aee347a	Genotype encoding uses the BCF2FieldEncoder system	2012-06-14 16:42:33 -04:00
Mark DePristo	9ac4203254	GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value -- Cleans up interface and avoids significant amounts of gross typing code	2012-06-14 16:42:32 -04:00
Mark DePristo	7506994d09	Nearing final BCF commit -- Cleanup some (but not all) VCF3 files. Turns out there are lots so... -- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface -- Misc. bugfixes in GenotypeBuilder	2012-06-14 16:42:32 -04:00
Mark DePristo	6272612808	Testing utility to perform diffs N times	2012-06-14 16:42:32 -04:00
Mark DePristo	8014178f2f	Algorithmically faster version of DiffEngine -- Now only includes leaf nodes in the summary, i.e., summaries of the form ".....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm -- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size. -- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well. -- Added integration test for new leaf and old pairwise calculations -- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields	2012-06-14 16:42:30 -04:00
Mark DePristo	2a86b81a3f	Initial version of clean, fast formatting routines built dynamically from a VCF header -- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level. -- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values -- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder -- Allowed us to naturally support encoding of lists of strings -- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion -- Fixes for integration test failures -- Enabling contig updates -- WalkerTest now prints out relative paths where possible to make cut/paste/run easier	2012-06-14 16:42:30 -04:00
Mark DePristo	51a3b6e25e	No more makePrecisionFormatStringFromDenominatorValue -- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating. -- Created a smart float formatter in VCFWriter, with unit tests -- Removed makePrecisionFormatStringFromDenominatorValue and its uses -- Fix broken contracted -- Refactored some code from the encoder to utils in BCF2 -- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.	2012-06-14 16:42:30 -04:00
Mark DePristo	43ad890fcc	Finalizing BCF2 v2 -- FastGenotypes are the default in the engine. Use --useSlowGenotypes engine argument to return to old representation -- Cleanup of BCF2Codec. Good error handling. Added contracts and docs. -- Added a few more contacts and docs to BCF2Decoder -- Optimized encodePrimitive in BCF2Encoder -- Removed genotype filter field exceptions -- Docs and cleanup of BCF2GenotypeFieldDecoders -- Deleted unused BCF2TestWalker -- Docs and cleanup of BCF2Types -- Faster version of decodeInts in VCFCodec -- BCF2Writer -- Support for writing a sites only file -- Lots of TODOs for future optimizations -- Removed lack of filter field support -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural -- Lots of docs and contracts -- Docs for GenotypeBuilder. More filter creation routines (unfiltered, for example) -- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters. Better genotype comparisons	2012-06-14 16:42:29 -04:00
Mark DePristo	37e5d32019	Remove logger.info statement	2012-06-14 16:42:29 -04:00
Mark DePristo	6cfb2d1393	Restoring SelectVariantsIntegrationTest	2012-06-14 16:42:28 -04:00
Mark DePristo	01ddf9555a	Performance optimizations for Genotype field decoding for GT field -- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over -- Contracts -- final classes	2012-06-14 16:42:28 -04:00
Mark DePristo	7fbca7013e	Don't add missing value binding from field to Genotype object in VCF3Codec	2012-06-14 16:42:28 -04:00
Mark DePristo	cfd1e50068	Minor updates to test code	2012-06-14 16:42:28 -04:00
Mark DePristo	54817f8d16	VCFHeaderUnitTest needed to be updated to reflect that we are doing VCF4.1 not VCF4.0	2012-06-14 16:42:28 -04:00
Mark DePristo	982192e2e4	MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis -- This file is in integrationtests/md5mismatches.txt, and looks like: expected observed test 7fd0d0c2d1af3b16378339c181e40611 2339d841d3c3c7233ebba9a6ace895fd test BeagleOutputToVCF 43865f3f0d975ee2c5912b31393842f8 1b9c4734274edd3142a05033e520beac testBeagleChangesSitesToRef daead9bfab1a5df72c5e3a239366118e 27be14f9fc951c4e714b4540b045c2df testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e -- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods	2012-06-14 16:42:27 -04:00
Mark DePristo	249d5e5533	Better tests for Genotype parsing	2012-06-14 16:42:27 -04:00
Mark DePristo	4a4d3cde3d	UnitTests for decodeIntArray method	2012-06-14 16:42:27 -04:00
Mark DePristo	5b8bd81991	An option to not actually write out the results of select variants -- Useful for performance testing of the SV operations themselves.	2012-06-14 16:42:26 -04:00
Mark DePristo	6f7a01e00d	Bugfix for BCF2 reader / writer for > 0x0FFF samples :-) -- Should be 0x00FFFFFF in the mask	2012-06-14 16:42:26 -04:00
Mark DePristo	1d4eb46606	Efficient reading of genotype fields v1 -- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object -- Code cleanup / contracts added were appropriate -- V2 will have a yet more optimized path...	2012-06-14 16:42:26 -04:00
Mark DePristo	37b8d70321	Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC	2012-06-14 16:42:25 -04:00
Mark DePristo	17fbd103d0	Smarter infrastructure to decode genotypes in BCF -- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder. The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2 -- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder. In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form -- Reduced the amount of data further to be computed in the DiffEngine. The DiffEngine algorithm needs to be rethought to be efficient...	2012-06-14 16:42:25 -04:00
Mark DePristo	889e3c4583	Code cleanup before major refactor	2012-06-14 16:42:25 -04:00
Mark DePristo	cebd37609c	Finalizing new Genotype object and associated routines -- Builder now provides a depreciated log10pError function to make a new GQ value -- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions -- Lots of contracts -- Bugfixes throughout	2012-06-14 16:42:25 -04:00
Mark DePristo	8b0a629a31	Terrible bugfix -- The way I was handling the contig offset ordering wasn't correct. Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict. It has nothing to do with the contig index itself. -- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly	2012-06-14 16:42:24 -04:00
Mark DePristo	d37a8a0bc8	Efficient Genotype object Intermediate commit -- Created a new Genotype interface with a more limited set of operations -- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable -- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder -- Modified lots and lots of code to use GenotypeBuilder -- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype -- Lots of bug fixes to BCF2 codec and encoder. -- Feature additions -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes -- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes -- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!	2012-06-14 16:42:24 -04:00
Mark DePristo	a648b5e65e	First step towards an efficient Genotype object -- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness. Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code. Much cleaner interface now, but not yet hooked up to anything -- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class -- Code cleanup. Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS. Uses in code replaced with constant	2012-06-14 16:42:23 -04:00
Mark DePristo	ff9ac4b5f8	BCF2 genotype decoding is now lazy -- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec. -- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec	2012-06-14 16:42:23 -04:00
Mark DePristo	9eb83a0771	Enable adding contigs to VariantContextWriters on output	2012-06-14 16:42:23 -04:00
Mark DePristo	8fc1a26ac7	Fixed comparison of VCFHeader as the set.equals() isn't working as expected	2012-06-14 16:42:22 -04:00
Mark DePristo	b0ea14ef0f	VCFHeader getMetaData returns 4.1 version not 4.0	2012-06-14 16:42:22 -04:00
Mark DePristo	5fda16bea9	Enable shadow BCF2	2012-06-14 16:42:22 -04:00
Mauricio Carneiro	7d12429917	First step towards indel qualities in RR Let the BI's and BD's pass through the reduce reads machinery	2012-06-14 15:37:39 -04:00
Mauricio Carneiro	e68038c5d8	Refactor post-processing downsampling using David's generic downsampler interface	2012-06-14 15:37:32 -04:00
Eric Banks	0398ae9695	I hate these disabled unit tests, #2	2012-06-14 15:19:27 -04:00
Eric Banks	676a57de7b	I hate these disabled unit tests	2012-06-14 14:03:58 -04:00
Eric Banks	de5508fcea	Bug fixes for cycle and context covariates	2012-06-14 13:01:14 -04:00
Eric Banks	5c3c6cbc40	Long -> long conversions in BQSR	2012-06-14 09:07:02 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
Guillermo del Angel	cd2074b1dc	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 20:59:30 -04:00
Guillermo del Angel	92669a0468	Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet	2012-06-13 20:59:17 -04:00
David Roazen	0550b27799	Make downsampler classes themselves generic (instead of just the Downsampler interface) This is in response to a request from Mauricio to make it easier to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords) without having to do any cumbersome typecasting. Sadly, Java language limitations make this sort of solution the best choice. Thanks to Khalid for his feedback on this issue. Also: -added a unit test to verify GATKSAMRecord support with no typecasting required -added some unit tests for the FractionalDownsampler that Mauricio will/might be using -moved classes from private to public to better sync up with my local development branch for engine integration	2012-06-13 16:43:39 -04:00
Guillermo del Angel	67c0569f9c	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 11:50:00 -04:00
Eric Banks	81993b08e2	Don't put null entries into the key array	2012-06-13 11:43:44 -04:00
Roger Zurawicki	bdf5945dcc	Fixed bugs in DiagnoseTargets DT would not report bad mates! that has been fixed Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:26 -04:00
Roger Zurawicki	538cdf9210	Created the FindCoveredIntervals Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class Minor tweaks FindCoveredIntervals supports Gathering FindCoveredIntervals outputs an interval list instead of GATKReport Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:25 -04:00
Guillermo del Angel	aee66ab157	Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact	2012-06-13 11:14:44 -04:00
Eric Banks	bb77aa88c3	Drat, forgot the unit tests again	2012-06-12 19:00:47 -04:00
Eric Banks	37f56ce8fd	A couple of minor updates to BQSR	2012-06-12 16:12:13 -04:00
Eric Banks	277493dd83	Yet more instances of Lists changed over to native arrays	2012-06-12 15:56:09 -04:00
Eric Banks	613badc835	Very minor optimizations for the context covariate	2012-06-12 15:47:32 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	1da3e43679	Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come.	2012-06-12 13:32:56 -04:00
Eric Banks	a96c5da884	Oops, forgot to push the unit tests	2012-06-12 11:38:30 -04:00
Eric Banks	fec0bd5e11	Fixing UG argument docs	2012-06-12 09:46:16 -04:00
Eric Banks	a4defdfb29	Adding a GT header line to SomaticIndelDetector output	2012-06-12 09:39:17 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Eric Banks	ff5749599d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 15:46:17 -04:00
Eric Banks	fea625632f	Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one	2012-06-11 15:45:58 -04:00
Ryan Poplin	e4d371dc80	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 10:38:50 -04:00
Ryan Poplin	683d4b508e	Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller.	2012-06-11 10:38:35 -04:00
Mauricio Carneiro	4aad7e23ef	New ReduceReads v2 with unclipped variant regions and soft-clipped bases * Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it. * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads * Updated all integration tests	2012-06-08 14:58:31 -04:00
Eric Banks	afa9b2718a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-08 13:54:48 -04:00
Eric Banks	92280b4068	BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime.	2012-06-08 13:54:37 -04:00
Eric Banks	898a0e6161	Minor optimizations	2012-06-08 12:07:58 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	d463ab2cbf	BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible.	2012-06-08 10:42:42 -04:00
Eric Banks	2bd48a7351	Bad comments made it into the previous commit	2012-06-07 23:12:56 -04:00
Eric Banks	31c3a6be48	BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR.	2012-06-07 20:04:10 -04:00
Eric Banks	0fb9179f76	BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array	2012-06-07 19:41:03 -04:00
Ryan Poplin	d449f169d3	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-07 10:56:55 -04:00
Ryan Poplin	0b4281fdd0	misc minor update to HC debug output for when there are a lot of samples	2012-06-07 10:56:41 -04:00
Eric Banks	bad50a1b05	Fix docs	2012-06-06 22:45:38 -04:00
Eric Banks	b093ba9dcc	Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records)	2012-06-06 15:17:30 -04:00
Eric Banks	54f682a99c	Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX.	2012-06-06 11:44:37 -04:00
Eric Banks	dd46d843fb	IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion.	2012-06-06 11:04:55 -04:00
Guillermo del Angel	2cbd6e5f90	Merged bug fix from Stable into Unstable	2012-06-05 15:58:23 -04:00
Guillermo del Angel	ce4dc2128d	Adding minor clarification to -mbq argument documentation	2012-06-05 15:17:56 -04:00
Eric Banks	e02ec8c8b6	Don't update the record ID unless we are actually going to emit the record	2012-06-04 14:58:50 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Ryan Poplin	f11e7ebc3a	Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA.	2012-06-04 12:49:36 -04:00
Ryan Poplin	320956ee4b	Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage.	2012-06-04 10:55:36 -04:00
Guillermo del Angel	7a54baf08c	Merged bug fix from Stable into Unstable	2012-06-03 08:42:08 -04:00

... 2 3 4 5 6 ...

2480 Commits (6f8e7692d4e67d5cfb2d2f9e33549ff4fbf2e5f8)