gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	31997f8092	Bugfixes on the way to passing integration tests -- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate -- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient. -- Fixed nasty subsetting bug in SelectVariants with excluding samples -- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT -- Bugfix for dropping old style GL field values -- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header -- Bugfix for Allele.getBaseString to properly show NO_CALL alleles -- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes	2012-06-14 16:42:33 -04:00
Mark DePristo	dd6aee347a	Genotype encoding uses the BCF2FieldEncoder system	2012-06-14 16:42:33 -04:00
Mark DePristo	7506994d09	Nearing final BCF commit -- Cleanup some (but not all) VCF3 files. Turns out there are lots so... -- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface -- Misc. bugfixes in GenotypeBuilder	2012-06-14 16:42:32 -04:00
Mark DePristo	8014178f2f	Algorithmically faster version of DiffEngine -- Now only includes leaf nodes in the summary, i.e., summaries of the form ".....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm -- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size. -- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well. -- Added integration test for new leaf and old pairwise calculations -- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields	2012-06-14 16:42:30 -04:00
Mark DePristo	51a3b6e25e	No more makePrecisionFormatStringFromDenominatorValue -- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating. -- Created a smart float formatter in VCFWriter, with unit tests -- Removed makePrecisionFormatStringFromDenominatorValue and its uses -- Fix broken contracted -- Refactored some code from the encoder to utils in BCF2 -- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.	2012-06-14 16:42:30 -04:00
Mark DePristo	6cfb2d1393	Restoring SelectVariantsIntegrationTest	2012-06-14 16:42:28 -04:00
Mark DePristo	982192e2e4	MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis -- This file is in integrationtests/md5mismatches.txt, and looks like: expected observed test 7fd0d0c2d1af3b16378339c181e40611 2339d841d3c3c7233ebba9a6ace895fd test BeagleOutputToVCF 43865f3f0d975ee2c5912b31393842f8 1b9c4734274edd3142a05033e520beac testBeagleChangesSitesToRef daead9bfab1a5df72c5e3a239366118e 27be14f9fc951c4e714b4540b045c2df testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e -- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods	2012-06-14 16:42:27 -04:00
Mark DePristo	d37a8a0bc8	Efficient Genotype object Intermediate commit -- Created a new Genotype interface with a more limited set of operations -- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable -- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder -- Modified lots and lots of code to use GenotypeBuilder -- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype -- Lots of bug fixes to BCF2 codec and encoder. -- Feature additions -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes -- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes -- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!	2012-06-14 16:42:24 -04:00
Eric Banks	0398ae9695	I hate these disabled unit tests, #2	2012-06-14 15:19:27 -04:00
Eric Banks	676a57de7b	I hate these disabled unit tests	2012-06-14 14:03:58 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
Eric Banks	bb77aa88c3	Drat, forgot the unit tests again	2012-06-12 19:00:47 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	a96c5da884	Oops, forgot to push the unit tests	2012-06-12 11:38:30 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Mark DePristo	08de4dfd96	Missed one integration test	2012-05-29 07:23:24 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	06b02e1b9b	Update MD5s to reflect new limited output of DiffObjectsWalkers -- Also updated GQ change in VCFIntegrationTest	2012-05-27 11:20:47 -04:00
Guillermo del Angel	a6ee4f98b5	Yet More missing md5's	2012-05-25 17:21:47 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Eric Banks	26968ae8eb	Forgot that the VCFStreamingOntegrationTest uses VE	2012-05-18 02:51:53 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00
Guillermo del Angel	27b1aa5dd3	Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's	2012-05-10 10:29:19 -04:00
Eric Banks	c40cda7e3c	Nope, loads of integration tests had to be changed.	2012-05-07 14:30:42 -04:00
Eric Banks	f3433201b1	Merged bug fix from Stable into Unstable	2012-05-03 11:11:00 -04:00
Eric Banks	557da77a1a	Don't compute QD if there is no QUAL; added integration test for this	2012-05-03 11:02:37 -04:00
Eric Banks	1fc7b5d58b	Merged bug fix from Stable into Unstable	2012-05-03 10:37:58 -04:00
Laurent Francioli	567d01cee8	- Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:49 -04:00
Laurent Francioli	96e5a26223	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:20 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Eric Banks	e448cfcc59	Forgot to update these md5s	2012-05-02 21:09:50 -04:00
Khalid Shakir	b8b7f28aa9	Revving Picard to pick up new SamFileHeaderMerger. Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut(). In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124. - Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs. - Updated md5s to match new expectations after looking at TLEN diff engine output.	2012-05-02 16:47:28 -04:00
Eric Banks	623b36fbc4	Add header lines for AC,AF, and AN tags	2012-05-02 15:33:34 -04:00
Eric Banks	619a69a5f1	As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).	2012-05-01 16:18:24 -04:00
Eric Banks	0f3af9555b	Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.	2012-05-01 14:58:06 -04:00
Eric Banks	0c8e801021	Removing public to private dependency	2012-05-01 11:04:11 -04:00
Eric Banks	e964d17518	Removing public to private dependency	2012-05-01 11:02:28 -04:00
Mauricio Carneiro	462450c3e3	disabling all BQSR unit tests with the changes to the cycle covariate, some tests need updates, others need to be completely re-written.	2012-04-30 14:39:55 -04:00
Guillermo del Angel	e185632013	Exhaustive unit tests for Pool SNP genotype likelihoods: a) Add ability for ErrorModel to be specified by external log-probability vector for testing. b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups c) Misc. cleanups and beautification.	2012-04-30 14:29:46 -04:00
Guillermo del Angel	730208133b	Several fixes and improvements to Pool caller with ancillary test functions (not done yet): a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value. b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time. c) Expand unit tests and add an exhaustive test for ErrorModel class. d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10. e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases. f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done). g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model. h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math	2012-04-27 14:41:17 -04:00
Khalid Shakir	9801dd114f	Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag() Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.	2012-04-27 09:58:38 -04:00

1 2 3 4 5 ...

423 Commits (7a8e2a8adabbd4e65f48aa85b0214da6f53f3fc4)