gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	9eb83a0771	Enable adding contigs to VariantContextWriters on output	2012-06-14 16:42:23 -04:00
Mark DePristo	8fc1a26ac7	Fixed comparison of VCFHeader as the set.equals() isn't working as expected	2012-06-14 16:42:22 -04:00
Mark DePristo	b0ea14ef0f	VCFHeader getMetaData returns 4.1 version not 4.0	2012-06-14 16:42:22 -04:00
Mark DePristo	5fda16bea9	Enable shadow BCF2	2012-06-14 16:42:22 -04:00
Mauricio Carneiro	7d12429917	First step towards indel qualities in RR Let the BI's and BD's pass through the reduce reads machinery	2012-06-14 15:37:39 -04:00
Mauricio Carneiro	e68038c5d8	Refactor post-processing downsampling using David's generic downsampler interface	2012-06-14 15:37:32 -04:00
Eric Banks	0398ae9695	I hate these disabled unit tests, #2	2012-06-14 15:19:27 -04:00
Eric Banks	676a57de7b	I hate these disabled unit tests	2012-06-14 14:03:58 -04:00
Eric Banks	de5508fcea	Bug fixes for cycle and context covariates	2012-06-14 13:01:14 -04:00
Eric Banks	5c3c6cbc40	Long -> long conversions in BQSR	2012-06-14 09:07:02 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
Guillermo del Angel	cd2074b1dc	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 20:59:30 -04:00
Guillermo del Angel	92669a0468	Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet	2012-06-13 20:59:17 -04:00
David Roazen	0550b27799	Make downsampler classes themselves generic (instead of just the Downsampler interface) This is in response to a request from Mauricio to make it easier to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords) without having to do any cumbersome typecasting. Sadly, Java language limitations make this sort of solution the best choice. Thanks to Khalid for his feedback on this issue. Also: -added a unit test to verify GATKSAMRecord support with no typecasting required -added some unit tests for the FractionalDownsampler that Mauricio will/might be using -moved classes from private to public to better sync up with my local development branch for engine integration	2012-06-13 16:43:39 -04:00
Guillermo del Angel	67c0569f9c	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 11:50:00 -04:00
Eric Banks	81993b08e2	Don't put null entries into the key array	2012-06-13 11:43:44 -04:00
Roger Zurawicki	bdf5945dcc	Fixed bugs in DiagnoseTargets DT would not report bad mates! that has been fixed Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:26 -04:00
Roger Zurawicki	538cdf9210	Created the FindCoveredIntervals Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class Minor tweaks FindCoveredIntervals supports Gathering FindCoveredIntervals outputs an interval list instead of GATKReport Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:25 -04:00
Guillermo del Angel	aee66ab157	Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact	2012-06-13 11:14:44 -04:00
Eric Banks	bb77aa88c3	Drat, forgot the unit tests again	2012-06-12 19:00:47 -04:00
Eric Banks	37f56ce8fd	A couple of minor updates to BQSR	2012-06-12 16:12:13 -04:00
Eric Banks	277493dd83	Yet more instances of Lists changed over to native arrays	2012-06-12 15:56:09 -04:00
Eric Banks	613badc835	Very minor optimizations for the context covariate	2012-06-12 15:47:32 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	1da3e43679	Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come.	2012-06-12 13:32:56 -04:00
Eric Banks	a96c5da884	Oops, forgot to push the unit tests	2012-06-12 11:38:30 -04:00
Eric Banks	fec0bd5e11	Fixing UG argument docs	2012-06-12 09:46:16 -04:00
Eric Banks	a4defdfb29	Adding a GT header line to SomaticIndelDetector output	2012-06-12 09:39:17 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Eric Banks	ff5749599d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 15:46:17 -04:00
Eric Banks	fea625632f	Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one	2012-06-11 15:45:58 -04:00
Ryan Poplin	e4d371dc80	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 10:38:50 -04:00
Ryan Poplin	683d4b508e	Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller.	2012-06-11 10:38:35 -04:00
Mauricio Carneiro	4aad7e23ef	New ReduceReads v2 with unclipped variant regions and soft-clipped bases * Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it. * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads * Updated all integration tests	2012-06-08 14:58:31 -04:00
Eric Banks	afa9b2718a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-08 13:54:48 -04:00
Eric Banks	92280b4068	BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime.	2012-06-08 13:54:37 -04:00
Eric Banks	898a0e6161	Minor optimizations	2012-06-08 12:07:58 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	d463ab2cbf	BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible.	2012-06-08 10:42:42 -04:00
Eric Banks	2bd48a7351	Bad comments made it into the previous commit	2012-06-07 23:12:56 -04:00
Eric Banks	31c3a6be48	BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR.	2012-06-07 20:04:10 -04:00
Eric Banks	0fb9179f76	BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array	2012-06-07 19:41:03 -04:00
Ryan Poplin	d449f169d3	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-07 10:56:55 -04:00
Ryan Poplin	0b4281fdd0	misc minor update to HC debug output for when there are a lot of samples	2012-06-07 10:56:41 -04:00
Eric Banks	bad50a1b05	Fix docs	2012-06-06 22:45:38 -04:00
Eric Banks	b093ba9dcc	Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records)	2012-06-06 15:17:30 -04:00
Eric Banks	54f682a99c	Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX.	2012-06-06 11:44:37 -04:00
Eric Banks	dd46d843fb	IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion.	2012-06-06 11:04:55 -04:00
Guillermo del Angel	2cbd6e5f90	Merged bug fix from Stable into Unstable	2012-06-05 15:58:23 -04:00
Guillermo del Angel	ce4dc2128d	Adding minor clarification to -mbq argument documentation	2012-06-05 15:17:56 -04:00
Eric Banks	e02ec8c8b6	Don't update the record ID unless we are actually going to emit the record	2012-06-04 14:58:50 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Ryan Poplin	f11e7ebc3a	Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA.	2012-06-04 12:49:36 -04:00
Ryan Poplin	320956ee4b	Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage.	2012-06-04 10:55:36 -04:00
Guillermo del Angel	7a54baf08c	Merged bug fix from Stable into Unstable	2012-06-03 08:42:08 -04:00
Guillermo del Angel	47df7bbc14	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-06-03 08:38:54 -04:00
Guillermo del Angel	2ddbdee3bc	Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow	2012-06-03 08:38:38 -04:00
Mauricio Carneiro	12a8c54f9a	Fixing VCF header for filter elements (thanks Eric)	2012-06-01 15:45:15 -04:00
Eric Banks	3a15ba2102	Malformed VCF headers should be User Errors	2012-05-31 16:05:53 -04:00
Khalid Shakir	c4f7df4dce	When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>".	2012-05-30 16:39:06 -04:00
Ryan Poplin	421d0d1435	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-30 15:21:35 -04:00
Ryan Poplin	5dd811f84a	Adding genotype given alleles mode to the HaplotypeCaller.	2012-05-30 15:07:01 -04:00
Eric Banks	d09b8d5584	Fixing docs	2012-05-30 13:24:08 -04:00
Mauricio Carneiro	d6e1205310	Updating default values for DiagnoseTargets	2012-05-30 12:43:07 -04:00
Khalid Shakir	c3c7f17d90	Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000. Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run. TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.	2012-05-29 11:18:22 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Eric Banks	50031b63c5	Fix possible NPE from NBaseCount annotation module	2012-05-29 09:46:00 -04:00
Mark DePristo	08de4dfd96	Missed one integration test	2012-05-29 07:23:24 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	7ce24a96f1	PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present	2012-05-28 20:18:16 -04:00
Mark DePristo	1818c29371	Fixed long-standing bug in beagle codec that was passing on the header record for decoding	2012-05-28 20:17:26 -04:00
Mark DePristo	06b02e1b9b	Update MD5s to reflect new limited output of DiffObjectsWalkers -- Also updated GQ change in VCFIntegrationTest	2012-05-27 11:20:47 -04:00
Mark DePristo	5894d045cb	Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests -- This version of BCF should actually work properly for most files, assuming headers are properly defined. -- Lots of bug fixes to BCF2 codec -- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL. NOTE THIS SEMANTICS change -- Equals() method for GenotypeLikelihoods, using PLs. -- VCFCodec now longer adds empty bindings to missing input field values. NOTE THIS CHANGE -- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work. The BCF2 codec now makes VCs marked as fully decoded -- stringToBytes returns empty list for null or "" string in BCF2Encoder -- Proper handling of genotype ordering in BCF2 reader / writer -- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily -- Many failing MD5s now due to double -> int change in GQ, will update later	2012-05-27 11:17:17 -04:00
Mark DePristo	86e5a066fc	Even more conservative limit on number of differences to summarize at 1000	2012-05-27 11:17:13 -04:00
Mark DePristo	31f4e5b52e	Stop unlimited runtimes in DiffEngine when you have lots of differences -- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000	2012-05-27 11:17:13 -04:00
Guillermo del Angel	a6ee4f98b5	Yet More missing md5's	2012-05-25 17:21:47 -04:00
Mauricio Carneiro	4109fcbb08	Merged bug fix from Stable into Unstable	2012-05-25 13:03:05 -04:00
Mauricio Carneiro	2be5704a25	Fixed haplotype boundary bug in PairHMMIndelErrorModel haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception). * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate. * removed unnecessary code to remove hard clips after processing. * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).	2012-05-25 13:00:45 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	d6df817174	Oops, don't enable shadow BCF tests	2012-05-24 13:31:13 -04:00
Mark DePristo	0a86564669	Updated test files didn't make it into last push	2012-05-24 13:29:44 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	ade1843818	Bugfix for not setting header in AbstractVCFCodec	2012-05-24 10:58:58 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	69ee4d0454	Moved getMetaDataForField to VariantContextUtils	2012-05-24 10:57:09 -04:00
Mark DePristo	cb13f16e90	WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file -- Currently disabled	2012-05-24 10:57:09 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Mark DePristo	4bde24f020	Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header	2012-05-24 10:57:08 -04:00
Mark DePristo	4846bf5c8e	@Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf -- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically	2012-05-24 10:57:07 -04:00
Mark DePristo	bb0d87666a	Finally just deleted equals() method in GATKArgumentCollection. -- We never compare these things in the codebase anyway...	2012-05-24 10:57:07 -04:00
Mark DePristo	6f469305ab	Don't try to share BCF2 yet	2012-05-24 10:57:06 -04:00
Mark DePristo	c8ed0bfc4c	Edge case fixes for BCF2 --handle entirely missing GT in a sample in decodeGenotypeAlleles --Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code -- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples -- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys -- Removed restriction on getPloidy requiring ploidy > 1. It's logically find to return 0 for a no called sample -- getMaxPloidy() in VC that does what it says -- Support for padding / depadding of generic genotype fields	2012-05-24 10:57:06 -04:00
Mark DePristo	40431890be	-- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it -- BCF2 writer can now work without the contig lines being in the header -- Made GenomeLocParser a final class	2012-05-24 10:57:06 -04:00
Mark DePristo	6301572009	GenotypeLikelihood PLs are capped at Short.MAX_INT now -- UserExceptions in BCF2 now where appropriate -- Asserts for code safety -- Public -> protected encode(Object v) method is for testing only	2012-05-24 10:57:06 -04:00
Mark DePristo	d52bc31a47	Bugfix for doNotWriteGenotypes mode -- Was outputing GT ./. in sites only mode. Fixed	2012-05-24 10:57:05 -04:00
Mark DePristo	64d4238e2f	99% working version of BCF2 encoder / decoder -- fixed final bugs with PL encoding / decoding -- Ready for testing by other members of the group -- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations -- Fixed a nasty bug in the filter field -- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types Read 1500 genotypes file in VCF -> VCF : 11 seconds Read 1500 genotypes file in VCF -> BCF : 9.5 seconds VariantEval 1500 genotypes file in VCF : 3 seconds VariantEval 1500 genotypes file in BCF : 3 seconds	2012-05-24 10:57:05 -04:00
Mark DePristo	b5bce8d3f9	AD should be UNBOUNDED, actually -- Pass in # alt alleles as appropriate for getCount in VCF header line	2012-05-24 10:57:05 -04:00
Mark DePristo	aaf11f00e3	Near final BCF2 implementation -- Trivial import changes in some walkers -- SelectVariants has a new hidden mode to fully decode a VCF file -- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type -- GenotypeLikelihoods now implements List<Double> for convenience. The PL duality here is going to be removed in a subsequent commit -- BugFixes in BCF2Writer. Proper handling of padding. Bugfix for nFields for a field -- padAllele function in VariantContextUtils -- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.	2012-05-24 10:57:02 -04:00
Mark DePristo	dfee17a672	Generalize / unify code for handling strings -- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder. -- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding -- Code cleanup and renaming	2012-05-24 10:57:02 -04:00

1 2 3 4 5 ...

2235 Commits (5ec737f008b8906c5cf7f5e7ddf7fd75e2ae4fbb)