gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	f0b081a85f	Update VCF.jar loading test -- to reflect new path to VCFWriter	2012-05-24 10:56:58 -04:00
Mark DePristo	c3b8048e2e	Moving around classes in VCF and BCF2 -- Refactored VCF writers into vcf.writers package -- Moved BCF2Writer to bcf2.writer -- Updates to all of the walkers using VCFWriter to reflect new packages -- A large number of files had their headers cleaned up because of this as well	2012-05-24 10:56:58 -04:00
Mark DePristo	679ffdd333	Move BCF2 from private utils to public codecs	2012-05-24 10:56:56 -04:00
Mark DePristo	450f098a61	BCF2 encoder / decoder implement new site / genotype block organization -- Supports final organization of data blocks into sites data and genotypes data	2012-05-24 10:56:55 -04:00
Mark DePristo	27b51d4dea	Enable on the fly indexing of BCF2	2012-05-24 10:56:54 -04:00
Mark DePristo	81bd7646d6	Fix for MISSING floats -- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int). -- Now handles correctly MISSING qual fields	2012-05-24 10:56:53 -04:00
Mark DePristo	3afbc50511	More BCF2 improvements -- Refactored setting of contigs from VCFWriterStub to VCFUtils. Necessary for proper BCF working -- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order. -- Cleaned up VCFHeader operations -- BCF now uses the right header files correctly when encoding / decoding contigs -- Clean up unused tools -- Refactored header parsing routines to make them more accessible -- More minor header changes from Intellij	2012-05-24 10:56:52 -04:00
Mark DePristo	0799855479	Archiving GCF -- Rider update to CramByPiece.scala	2012-05-24 10:56:51 -04:00
Guillermo del Angel	43919078cd	Merged bug fix from Stable into Unstable	2012-05-23 21:21:01 -04:00
Guillermo del Angel	4bc04e2a9e	Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler.	2012-05-23 21:19:30 -04:00
Ryan Poplin	08dfd6cab6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-21 16:47:07 -04:00
Ryan Poplin	04000d920c	Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads.	2012-05-21 16:46:59 -04:00
Eric Banks	666862af19	Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs	2012-05-21 16:03:29 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Eric Banks	7f5ec17d22	Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table.	2012-05-21 14:16:13 -04:00
Eric Banks	92d8aa3d4c	Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels	2012-05-21 09:38:52 -04:00
Eric Banks	3af3834d50	Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian): * For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE. Switched over to booleans. * We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.	2012-05-18 11:55:41 -04:00
Eric Banks	26968ae8eb	Forgot that the VCFStreamingOntegrationTest uses VE	2012-05-18 02:51:53 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	5189b06468	New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality.	2012-05-17 15:28:19 -04:00
Eric Banks	0f7c917e7a	Better error checking and messages for bad alleles	2012-05-17 13:36:42 -04:00
Eric Banks	d44886d9e8	Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed.	2012-05-15 10:29:04 -04:00
Eric Banks	819c3d0c15	Adding to the Hrun docs	2012-05-15 10:27:52 -04:00
Guillermo del Angel	5fc3adbb04	One more VariantsToTable bug fix	2012-05-14 14:10:07 -04:00
Guillermo del Angel	04d691f04a	Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit	2012-05-14 14:01:29 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00
Ryan Poplin	c9dd0f3173	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-10 13:09:10 -04:00
Ryan Poplin	0cdadffe14	Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes.	2012-05-10 13:09:03 -04:00
Guillermo del Angel	89f8a6b2e6	Revert bad part of last commit that shouldn't have been pushed	2012-05-10 10:41:08 -04:00
Guillermo del Angel	27b1aa5dd3	Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's	2012-05-10 10:29:19 -04:00
Eric Banks	4f37d6d399	Fixing docs	2012-05-10 00:56:00 -04:00
Mark DePristo	c81acfc15d	Working implementation of BCF2 -- Nearly complete on spec implementation. Slow but clean -- Some refactoring of VariantContext to support common functions for BCF and VCF	2012-05-08 19:46:51 -04:00
Mark DePristo	a5193c2399	Mostly complete reference implementation of BCF2 -- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF	2012-05-08 19:46:51 -04:00
Eric Banks	473d07b0c5	fixing up docs from previous Pool Caller commit	2012-05-08 11:02:55 -04:00
Eric Banks	b4999d14c1	updating docs	2012-05-08 10:58:46 -04:00
Guillermo del Angel	33a1dd2048	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 10:42:12 -04:00
Eric Banks	5cf4fd63c2	Catch malformed base qualities and throw as a User Error	2012-05-08 09:34:57 -04:00
Guillermo del Angel	a4f4b5007b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 09:34:33 -04:00
Guillermo del Angel	605984353f	Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model.	2012-05-08 09:33:38 -04:00
Eric Banks	c40cda7e3c	Nope, loads of integration tests had to be changed.	2012-05-07 14:30:42 -04:00
Eric Banks	66838a073e	Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed...	2012-05-07 12:20:11 -04:00
David Roazen	6b769e91d8	BCF2: third checkpoint * writer mostly implemented * walkers to convert BCF2 <-> VCF * almost working for sites-only files; genotypes still need work * initial performance tests this afternoon will be on sites-only files	2012-05-04 13:00:15 -04:00
Eric Banks	f3433201b1	Merged bug fix from Stable into Unstable	2012-05-03 11:11:00 -04:00
Eric Banks	557da77a1a	Don't compute QD if there is no QUAL; added integration test for this	2012-05-03 11:02:37 -04:00
Eric Banks	1fc7b5d58b	Merged bug fix from Stable into Unstable	2012-05-03 10:37:58 -04:00
Laurent Francioli	567d01cee8	- Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:49 -04:00
Laurent Francioli	96e5a26223	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:20 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Eric Banks	e448cfcc59	Forgot to update these md5s	2012-05-02 21:09:50 -04:00
Khalid Shakir	b8b7f28aa9	Revving Picard to pick up new SamFileHeaderMerger. Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut(). In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124. - Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs. - Updated md5s to match new expectations after looking at TLEN diff engine output.	2012-05-02 16:47:28 -04:00
Mauricio Carneiro	f51a1d0d61	Better error message to the BAMScheduler In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.	2012-05-02 16:10:00 -04:00
Mauricio Carneiro	940029fa5d	Fixing on-the-fly recalibration (caught by Ryan) low quality bases in the tails were being turned to N's in the final read.	2012-05-02 16:06:04 -04:00
Eric Banks	623b36fbc4	Add header lines for AC,AF, and AN tags	2012-05-02 15:33:34 -04:00
Guillermo del Angel	429800a192	Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value.	2012-05-02 09:57:06 -04:00
Guillermo del Angel	76a95fdedf	Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though).	2012-05-02 09:24:28 -04:00
Joel Thibault	4d732fa586	Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb	2012-05-01 18:23:51 -04:00
Eric Banks	619a69a5f1	As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).	2012-05-01 16:18:24 -04:00
Joel Thibault	c255dd5917	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:10:38 -04:00
Ryan Poplin	51af61b5d7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:07:23 -04:00
Ryan Poplin	fc55dcec3c	Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now.	2012-05-01 16:02:36 -04:00
Ryan Poplin	20a0078f23	Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big.	2012-05-01 15:51:36 -04:00
Eric Banks	0f3af9555b	Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.	2012-05-01 14:58:06 -04:00
Joel Thibault	aa4d41cce0	Minor cleanup before push	2012-05-01 14:16:44 -04:00
Joel Thibault	b101b9c30b	Add Mongo switch	2012-05-01 14:00:48 -04:00
Joel Thibault	1b609e9075	Move Mongo to server couchdb	2012-05-01 13:59:47 -04:00
Joel Thibault	fd57d27f45	Move MongoDB connection handling to a separate class	2012-05-01 13:59:37 -04:00
Joel Thibault	db3cd1abd5	Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes.	2012-05-01 13:57:23 -04:00
Joel Thibault	04e1be9106	Better handling of Mongo errors + exceptions	2012-05-01 13:57:23 -04:00
Joel Thibault	ca737479cf	Query for stop locations because we don't have that information in the reference	2012-05-01 13:57:23 -04:00
Joel Thibault	1cda87a4ad	Set ROD priority list to input	2012-05-01 13:57:23 -04:00
Joel Thibault	a7fe847faf	Set the priority list and don't bother combining if not needed	2012-05-01 13:57:23 -04:00
Joel Thibault	f739305f43	Combine the variants found at a location	2012-05-01 13:57:23 -04:00
Joel Thibault	020f884d5a	Use new key of source ROD plus alleles	2012-05-01 13:57:23 -04:00
Joel Thibault	221ce9c3d6	Add alleles to the primary key	2012-05-01 13:57:23 -04:00
Joel Thibault	3198ce5471	Can have multiple variants at a location	2012-05-01 13:57:22 -04:00
Joel Thibault	11ed8e61c9	Add referenceBaseForIndel to the Mongo VariantContext objects	2012-05-01 13:53:44 -04:00
Joel Thibault	7ed0ee7ed0	Skip locations with no genotypes instead of throwing a NPE	2012-05-01 13:53:44 -04:00
Joel Thibault	4bdfeacdaa	Handle multiple samples/genotypes per location TODO: sample selection	2012-05-01 13:53:43 -04:00
Joel Thibault	1f7c628796	Insert the ROD filename into MongoDB as part of the primary key	2012-05-01 13:53:43 -04:00
Joel Thibault	bb8a6e9b0a	Initial test of write and read from MongoDB	2012-05-01 13:53:43 -04:00
David Roazen	c0084c741b	Pilot BCF2 Implementation: Checkpointing the code * Not working yet, still very much a work-in-progress with lots of placeholders * Needed to check this in to enable possible collaboration, since it's going slower than anticipated and the conference deadline looms.	2012-05-01 12:23:10 -04:00
Eric Banks	0c8e801021	Removing public to private dependency	2012-05-01 11:04:11 -04:00
Eric Banks	e964d17518	Removing public to private dependency	2012-05-01 11:02:28 -04:00
Mauricio Carneiro	462450c3e3	disabling all BQSR unit tests with the changes to the cycle covariate, some tests need updates, others need to be completely re-written.	2012-04-30 14:39:55 -04:00
Guillermo del Angel	e185632013	Exhaustive unit tests for Pool SNP genotype likelihoods: a) Add ability for ErrorModel to be specified by external log-probability vector for testing. b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups c) Misc. cleanups and beautification.	2012-04-30 14:29:46 -04:00
Christopher Hartl	7d029b9a28	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-30 12:16:30 -04:00
Christopher Hartl	944a7d815e	Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels.	2012-04-28 11:31:03 -04:00
Ryan Poplin	54a9bc2da2	Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles.	2012-04-28 09:12:26 -04:00
Ryan Poplin	e332aeaf70	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 16:21:21 -04:00
Ryan Poplin	2b5dd28550	Bug fix in reverse trim alleles for the case of mixed records.	2012-04-27 16:21:02 -04:00
Mauricio Carneiro	1db2d1ba82	Do not add the first and last 4 cycles to the recalibration tables.	2012-04-27 15:18:07 -04:00
Mauricio Carneiro	08dbd756f3	Quick QC walkers to look at the error profile of indels in the read	2012-04-27 15:18:07 -04:00
Guillermo del Angel	730208133b	Several fixes and improvements to Pool caller with ancillary test functions (not done yet): a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value. b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time. c) Expand unit tests and add an exhaustive test for ErrorModel class. d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10. e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases. f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done). g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model. h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math	2012-04-27 14:41:17 -04:00
Eric Banks	0439047269	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 10:49:45 -04:00
Eric Banks	05b44dd017	The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods.	2012-04-27 10:49:36 -04:00
Khalid Shakir	9801dd114f	Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag() Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.	2012-04-27 09:58:38 -04:00
Guillermo del Angel	2f86ccb086	Correct md5's for previous code change	2012-04-26 16:20:41 -04:00
Guillermo del Angel	972d6531b6	Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case.	2012-04-26 10:15:26 -04:00
Laurent Francioli	ab2a952ad1	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:56:47 -04:00
Laurent Francioli	219b0a128b	PED support for ChromosomeCounts annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:50:04 -04:00
Laurent Francioli	19d5213d5a	Added function to get founders IDs in SampleDB Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:49:36 -04:00
Mauricio Carneiro	902277856e	fix for RBP getPileupsForSamples() do not differentiate per sample pileups from generic pileups. Do the same for both -- it's O(n) either way.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	82b4798913	CountBasesWalker -- a quick QC walker.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	e440d0ce69	BQSR triage #4 * fixed queue script plot file names * updated the ReadGroupCovariate to use the platform unit instead of sample + lane. * fixed plotting of marginalized reported qualities	2012-04-24 17:19:54 -04:00
Eric Banks	d6277b70d8	Forgot to consider the optimized case in hasAllele	2012-04-24 11:32:28 -04:00
Eric Banks	91bad244d5	Using a VCF whose ALT is the reference in GGA mode is a User Error	2012-04-24 11:08:37 -04:00
Eric Banks	74ad008163	Adding VariantContext.hasAlternateAllele functionality	2012-04-24 11:07:46 -04:00
Eric Banks	66f3315548	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-24 09:39:55 -04:00
Eric Banks	bcb93dda5f	Fixing docs (rank sum test values are not phred-scaled)	2012-04-24 09:39:42 -04:00
Mauricio Carneiro	e39a59594a	BQSR triage and test routines * updated BQSR queue script for faster turnaround * implemented plot generation for scatter/gatherered runs * adjusted output file names to be cooperative with the queue script * added the recalibration report file to the argument table in the report * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read * added RecalibrationReport unit test -- guarantees the integrity of the delta tables	2012-04-23 11:23:00 -04:00
Eric Banks	a733723439	Merged bug fix from Stable into Unstable	2012-04-23 10:30:30 -04:00
Eric Banks	2761da975e	Handle null VCs (which can arise when indels are present in the file)	2012-04-23 10:30:00 -04:00
Eric Banks	cd63bcb1b8	Fixing unit tests to register the user exception being thrown (instead of the NumberFormatException)	2012-04-23 10:06:51 -04:00
Eric Banks	63aa79df82	Slightly better error message	2012-04-23 09:37:28 -04:00
Eric Banks	7b5fbf9567	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-23 09:34:08 -04:00
Eric Banks	4edb005411	Catch poorly formatted PL/GL fields	2012-04-23 09:33:50 -04:00
Ryan Poplin	35bb55f562	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-22 13:23:36 -04:00
Ryan Poplin	18e4532d10	Turning down the amount of assembly graph pruning slightly in the case of low coverage.	2012-04-22 13:23:24 -04:00
Eric Banks	1f23d99dfa	If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did.	2012-04-20 17:00:05 -04:00
Eric Banks	4b81c75642	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:30:19 -04:00
Eric Banks	f1c5510ec0	When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic.	2012-04-20 14:30:04 -04:00
Ryan Poplin	a1596791af	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:03:04 -04:00
Ryan Poplin	a57295eb75	Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold).	2012-04-20 14:02:55 -04:00
Guillermo del Angel	de68363c23	Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls	2012-04-20 10:58:34 -04:00
Guillermo del Angel	d2488dfb81	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 19:40:03 -04:00
Guillermo del Angel	c44c7b9a97	Restored optimization in Pair HMM only to compute HMM matrices starting in index where haplotypes start to diverge - saves about 15-20% of runtime which is what we lost by disabling banding in latest version, so runtime should be now about the same as what it was before refactoring. Output is bit-true to previous commit	2012-04-19 19:39:43 -04:00
Mauricio Carneiro	0f8c77391d	BQSR bug triage #3 * fixed context covariate famous "off by one" error * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion) * remove context downsampling in BQSR R script	2012-04-19 17:31:04 -04:00
Khalid Shakir	df5dd841af	AC strat now checks if evals will be merged before throwing an error on multiple eval files. Minor tweaks to WGP script based on new recal VCF format.	2012-04-19 16:08:55 -04:00
Guillermo del Angel	1ae2ab5b63	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 12:50:29 -04:00
Guillermo del Angel	0e6e0cb907	Merging bug fixes	2012-04-19 12:49:30 -04:00
Eric Banks	79272c5e15	Thanks to Menachem for pointing out that the docs for genotyping_mode and output_mode were the same (and unclear). Fixed.	2012-04-19 12:48:09 -04:00
Guillermo del Angel	02ff930f6a	My changes	2012-04-19 12:45:18 -04:00
Eric Banks	2485cef5b8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 11:46:06 -04:00
Eric Banks	76a6e37f4f	Don't output callability metrics by default anymore; one can still have them output to the 'metrics' file (which is now @Hidden because they are really for GSA use). Added a TODO to move UG from @By reference to reads and rods once LIBS is cleaned up.	2012-04-19 11:45:56 -04:00
Ryan Poplin	1ea4e48a27	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 11:32:32 -04:00
Ryan Poplin	11001ab9a2	Adding option to HaplotypeCaller to genotype the events on the chosen haplotypes as independent events. The filtered reads are now kept around so they can be passed to the variant annotations. Unfortunately the filtered reads aren't assigned a likelihood yet so they are all thrown in the Allele.NO_CALL bin.	2012-04-19 11:32:10 -04:00
Mauricio Carneiro	eb22cd7222	Unit test to guarantee BQSR sequential calculation accuracy This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.	2012-04-19 09:33:40 -04:00
Mauricio Carneiro	68d0211fa1	Improved BQSR plotting and some new parameters * Refactored CycleCovariate to be a fragment covariate instead of a per read covariate * Refactored the CycleCovariateUnitTest to test the pairing information * Updated BQSR Integration tests accordingly * Made quantization levels parameter not hidden anymore * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted) * Added hidden option not to generate the plots automatically (important for scatter/gathering)	2012-04-19 09:31:41 -04:00
Guillermo del Angel	143e92b797	Rebasing	2012-04-18 20:05:43 -04:00
Guillermo del Angel	960e7e6aaf	Changes to integration tests	2012-04-18 19:53:42 -04:00
Guillermo del Angel	82efd4457e	Revert some bad merge changes	2012-04-18 16:35:09 -04:00
Guillermo del Angel	31c394d588	Resolve merge conflicts	2012-04-18 16:25:03 -04:00
Ryan Poplin	4999ae87ad	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-18 15:02:42 -04:00
Ryan Poplin	dcc4871468	minor misc optimizations to PairHMM	2012-04-18 15:02:26 -04:00
Eric Banks	d3c84e7b1f	This should be a User Error since it's provided from the DoC command-line arguments	2012-04-18 13:09:23 -04:00
Eric Banks	392f1903f7	Handling some of the NumberFormatExceptions seen via Tableau that are really user errors.	2012-04-18 12:57:37 -04:00
Ryan Poplin	8a84456626	Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible.	2012-04-18 11:24:04 -04:00
Eric Banks	4448a3ea76	Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position.	2012-04-17 23:54:10 -04:00
Eric Banks	c1f52b773a	Minor tweaks and updated integration tests MD5s	2012-04-17 23:17:28 -04:00
Eric Banks	6d03bce0d3	Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file. The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration. For 1000G calling this was prohibitive in terms of memory requirements. Now we go through the rod system and pull in just the records we need at a given position. As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).	2012-04-17 22:38:18 -04:00
Eric Banks	ea793d8e27	Khalid pressured me into adding an integration test that makes sure we don't fail on reads with adjacent I and D events.	2012-04-17 21:21:29 -04:00
Mauricio Carneiro	46a212d8e9	Added "simplify reads" option to PrintReads.	2012-04-17 19:32:34 -04:00
Mauricio Carneiro	f0c81b59b0	Implementation of the new BQSR plotting infrastructure * removed low quality bases from the recalibration report. * refactored the Datum (Recal and Accuracy) class structure * created a new plotting csv table for optimized performance with the R script * added a datum object that carries the accuracy information (AccuracyDatum) for plotting * added mean reported quality score to all covariates * added QualityScore as a covariate for plotting purposes * added unit test to the key manager to operate with one required covariate and multiple optional covariates * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)	2012-04-17 19:23:55 -04:00
Ryan Poplin	952280bef1	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-17 17:00:14 -04:00
Ryan Poplin	cf705f6c62	Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller	2012-04-17 17:00:00 -04:00
Eric Banks	13c800417e	Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code.	2012-04-17 15:51:23 -04:00
Guillermo del Angel	c78b0eee3a	Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.	2012-04-17 14:22:48 -04:00
Khalid Shakir	91cb654791	AggregateMetrics: - By porting from jython to java now accessible to Queue via automatic extension generation. - Better handling for problematic sample names by using PicardAggregationUtils. GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name. CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering. Added SelectHeaders walker for filtering headers for dbGAP submission. Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter. Latest WholeGenomePipeline. Other minor cleanup to utility methods.	2012-04-17 11:45:32 -04:00
Ryan Poplin	1a2e92f8db	Merged bug fix from Stable into Unstable	2012-04-17 10:23:05 -04:00
Ryan Poplin	adad76b36f	Fixing NPE in VQSR for the case of very small callsets.	2012-04-17 10:20:43 -04:00
Mark DePristo	3f6b2423d8	Update VE IT to reflect new fields and bugfixes	2012-04-13 17:00:37 -04:00
Mark DePristo	f9190b6fcd	VariantEvalUnitTest is better named VariantEvalWalkerUnitTest	2012-04-13 17:00:37 -04:00
Mark DePristo	23ccf772d4	IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes	2012-04-13 17:00:36 -04:00
Mark DePristo	84d1e8713a	Infrastructure for combining VariantEvaluations -- Not hooked up yet, so the output of VariantEval should be the same as before -- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines -- Better docs throughout	2012-04-13 17:00:36 -04:00
Mark DePristo	38986e4240	Documentation for StratificationManager	2012-04-13 17:00:36 -04:00
Mark DePristo	ab06d53867	Useful test constructor or Unit tests in RefMetaDataTracker	2012-04-13 17:00:36 -04:00
Mark DePristo	285e61a227	Bugfix for IndelSummary -- multi allelic count should be % not ratio	2012-04-13 17:00:35 -04:00
Mark DePristo	e6d5cb46d2	Improvements and bugfixes to IndelSummary -- Now properly includes both bi and multi-allelic variants. These are actually counted as well, and emitted as counts and % of sites with multiple alleles -- Bug fix for gold standard rate	2012-04-13 17:00:35 -04:00
Mark DePristo	bfa966a4e9	Bugfix for OneBPIndel -- Previously was only including 1 bp insertions in stratification	2012-04-13 17:00:35 -04:00
Mark DePristo	2aa2d9aec0	Merged bug fix from Stable into Unstable	2012-04-13 09:25:43 -04:00
Mark DePristo	27e7e17dc7	New way to handle exceptions in multi-threaded GATK -- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now. -- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer -- Better printing of stack traces in WalkerTest	2012-04-13 09:23:33 -04:00
Mark DePristo	e85e9a8cf5	More extensive testing of type of error thrown in multi-threaded walker test -- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown -- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs	2012-04-13 09:23:33 -04:00
Eric Banks	297afc7911	Added unit test to ensure that we genotype correctly cases with really large GLs	2012-04-12 15:43:14 -04:00
Eric Banks	818e8c2fb9	Resolving merge conflicts	2012-04-12 15:19:44 -04:00
Eric Banks	0dd571928d	Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones).	2012-04-12 15:16:29 -04:00
Eric Banks	f77a6d18b8	Bad conflict merge before	2012-04-12 09:56:49 -04:00
Eric Banks	33a8bdd75f	Resolving merge conflicts	2012-04-12 09:51:55 -04:00
Eric Banks	b659b16b31	Generate User Error for bad POS value	2012-04-12 09:49:35 -04:00
Eric Banks	cc71baf691	Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing.	2012-04-12 09:18:44 -04:00
Eric Banks	5bf9dd2def	A framework to get annotations working in the HaplotypeCaller (and ART walkers in general). Adding support for active-region-based annotation for most standard annotations. I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest. IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller). The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system. When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.	2012-04-11 16:22:12 -04:00
Guillermo del Angel	f9f8589692	Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.	2012-04-11 13:56:51 -04:00
Eric Banks	5b7da3831f	Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG	2012-04-11 13:49:50 -04:00
Eric Banks	7aa654d13f	New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now	2012-04-11 13:49:09 -04:00
Eric Banks	dc90508104	Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.	2012-04-11 13:47:10 -04:00
Eric Banks	d2142c3aa7	Adding integration test for Flag Stat	2012-04-10 22:40:38 -04:00
Eric Banks	f560611fe8	Merged bug fix from Stable into Unstable	2012-04-10 22:26:53 -04:00
Eric Banks	f46f7d0590	Fix the stats coming out of FlagStat. I will add an integration test in unstable	2012-04-10 22:26:10 -04:00
Mauricio Carneiro	cd842b650e	Optimizing DiagnoseTargets * Fixed output format to get a valid vcf * Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples * Added support to overlapping intervals * Removed expand target functionality (for now) * Removed total depth (pointless metric)	2012-04-10 17:43:59 -04:00
Ryan Poplin	1df0adf862	Fixing ActivityProfile unit test.	2012-04-10 15:28:27 -04:00
Ryan Poplin	e3cc7cc59c	Resolving merge conflict.	2012-04-10 14:50:27 -04:00
Ryan Poplin	a4634624b7	There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.	2012-04-10 14:48:23 -04:00
Eric Banks	10e74a71eb	We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.	2012-04-10 12:30:35 -04:00
Mark DePristo	b43d21056b	Merged bug fix from Stable into Unstable	2012-04-10 09:42:09 -04:00
Mark DePristo	6885e2d065	UserException fixes for GATK_logs recent errors -- SamFileReader.java:525 -- BlockCompressedInputStream:376 These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.	2012-04-10 07:37:42 -04:00
Mark DePristo	8507cd7440	Throw UserException for bad dict / chain files	2012-04-10 07:22:43 -04:00
Ryan Poplin	cd9bf1bfc3	Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs.	2012-04-10 00:22:40 -04:00
Roger Zurawicki	9ece93ae9c	DiagnoseTargets now outputs a VCF file - refactored the statistics classes - concurrent callable statuses by sample are now available. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-04-09 16:40:20 -04:00
Guillermo del Angel	719ec9144a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-09 14:53:19 -04:00
Guillermo del Angel	550179a1f7	Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.	2012-04-09 14:53:05 -04:00
Eric Banks	f82986ee62	Adding unit tests for the very important log10sumLog10 util method.	2012-04-09 14:28:25 -04:00
Eric Banks	ea4300d583	Refactoring so that Unified Argument Collection doesn't use deprecated classes.	2012-04-09 13:45:17 -04:00
Eric Banks	6ddf2170b6	More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice.	2012-04-09 11:46:16 -04:00
Mauricio Carneiro	87e6bea6c1	Adding engine capability to quantize qualities. * Added parameter -qq to quantize qualities using a recalibration report * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization. * Updated BQSR scripts to make use of the new parameters	2012-04-08 21:07:51 -04:00
Mark DePristo	c22a66870c	Modified UnitTests to respect reference padding	2012-04-06 16:27:20 -04:00
Mark DePristo	45fc0ea98d	Improvements to indel analysis capabilities of VariantEval -- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites -- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly: // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a // downstream frameshift, if we make the simplifying assumptions that 3 bp ins // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions // as for deletions, and certainly would expect consistency between in/dels that // multiple methods find and in/dels that are unique to one method (since deletions // are more common and the artifacts differ, it is probably worth looking at the totals, // overlaps and ratios for insertions and deletions separately in the methods // comparison and in this case don't even need to make the simplifying in = del functional assumption -- Added a new VEW argument to bind a gold standard track -- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do -- Deleted random unused functions in IndelUtils	2012-04-06 16:07:46 -04:00
Mark DePristo	52ef4a3e26	Function to compute whether a VariantContext indel is part of a TandemRepeat Returns true iff VC is an non-complex indel where every allele represents an expansion or contraction of a series of identical bases in the reference. The logic of this function is pretty simple. Take all of the non-null alleles in VC. For each insertion allele of n bases, check if that allele matches the next n reference bases. For each deletion allele of n bases, check if this matches the reference bases at n - 2 n, as it must necessarily match the first n bases. If this test returns true for all alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the base differences between the ref and alt alleles	2012-04-06 16:07:46 -04:00
Mark DePristo	08fab49d30	Added function to get bases from the current base forward in the window in ReferenceContext	2012-04-06 16:07:46 -04:00
Ryan Poplin	c77104b815	Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL	2012-04-06 00:22:52 -04:00
Mauricio Carneiro	a19c27297f	continuing the BQSR triage... * fixed the loading of the new reduced size reports * reduced BQSR scala script memory to 2Gb * removed dcov parameter from BQSR scala script * fixed estimatedQReported calculation from -log10(pe) to -10log10(pe). updated md5's with the proper PHRED scaled EstimatedQReported	2012-04-05 14:34:15 -04:00
Eric Banks	3561056a9c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-05 10:49:26 -04:00
Eric Banks	5c3ddec4c2	Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.	2012-04-05 10:49:08 -04:00
Mauricio Carneiro	7c3b3650bb	BQSR bug triage * fixed bug where some keys were using the same recal datum objects * fixed quantization qual calculations when combining multiple reports * fixed rounding error with empirical quality reported when combining reports * fixed combine routine in the gatk reports due to the primary keys being out of order * added auto-recalibration option to BQSR scala script * reduced the size of the recalibration report by ~15% * updated md5's	2012-04-05 09:32:18 -04:00
Eric Banks	2c956efa53	Minor fixups to GenotypeLikelihoods	2012-04-05 09:14:37 -04:00
Mauricio Carneiro	1e65474fec	Added utility to get the reference coordinate given the read coordinate	2012-04-05 09:04:20 -04:00
Guillermo del Angel	6913710e89	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 20:17:18 -04:00
Mark DePristo	76e4100d89	By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots -- Updated integration tests as well	2012-04-04 18:48:03 -04:00
Guillermo del Angel	820216dc68	More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation	2012-04-04 16:23:10 -04:00
Ryan Poplin	bfad26353a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 16:04:50 -04:00
Ryan Poplin	dda2173c66	Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.	2012-04-04 16:04:29 -04:00
Mark DePristo	fcdd65a0f4	Bugfix for IndelLengthHistogram -- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.	2012-04-04 15:37:43 -04:00
Mark DePristo	1ccea866d8	VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses -- Updated EvalModules to work with new paramter -- adding test file for keepAC0 to public/testdata and integration tests	2012-04-04 15:37:12 -04:00
Eric Banks	9e32a975f8	Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.	2012-04-04 13:47:59 -04:00
Eric Banks	337ff7887a	When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.	2012-04-04 10:57:05 -04:00
Guillermo del Angel	05d8400468	Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)	2012-04-03 20:51:24 -04:00
Guillermo del Angel	5a10f173ea	Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)	2012-04-03 18:55:52 -04:00
Guillermo del Angel	5abb07da5d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 17:00:45 -04:00
Christopher Hartl	a6837d31d4	Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.	2012-04-03 16:13:16 -04:00
Guillermo del Angel	63b1e737c6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 15:43:50 -04:00
Guillermo del Angel	9e11b4f9a7	Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.	2012-04-03 15:43:32 -04:00
Eric Banks	f9ce9962c4	Minor changes to verbose mode	2012-04-03 10:53:48 -04:00
Eric Banks	f6aa95685d	OutOfMemory exceptions are User Errors	2012-04-02 22:46:56 -04:00
Eric Banks	659b82e74d	Old -B syntax is long gone at this point. Safe to remove the warning.	2012-04-02 22:25:16 -04:00
Eric Banks	326220c91c	Removing extended event related unit tests	2012-04-02 14:40:36 -04:00
Eric Banks	99d27ddcc4	Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.	2012-04-02 14:27:36 -04:00
Mark DePristo	6b7a00061a	VariantsToTable now works with multiple input VCFs	2012-04-02 09:13:35 -04:00
Mark DePristo	4f73ea902f	Final update for VE. VCFStreaming wasn't yet updated	2012-03-30 21:52:01 -04:00
Mark DePristo	fbbb8509ad	Final commits to VariantEval -- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to. -- Cleanup code, reorganize a bit more. -- Fix for broken integrationtests	2012-03-30 20:11:06 -04:00
Mark DePristo	4b45a2c99d	Final version of new VariantEval infrastructure. * WAY FASTER * -- 3x performance for multiple sample analysis with 1000 samples -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2 -- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval -- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also. -- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do. -- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter. -- Commented out GenotypePhasingEvaluator, as it uses the retired TableType -- Stratifications are all fully typed, so it's easy for GATKReports to format them. -- Removed old VE work around from GATKReportColumn -- General code cleanup throughout -- Updated integration tests	2012-03-30 15:31:56 -04:00
Mark DePristo	8c0718a7c9	Fixed missing import	2012-03-30 15:31:55 -04:00
Mark DePristo	976bac0452	BaseTest now has a global variable to turn off network connection requirement	2012-03-30 15:31:55 -04:00
Mark DePristo	097ed4ecc4	Memory usage optimizations and safety improvements to StratNode and StratificationManager -- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users. -- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals	2012-03-30 15:31:55 -04:00
Mark DePristo	b335c22f6d	Fully refactored, mostly cleaned up version of VariantEval using StratificationManager	2012-03-30 15:31:55 -04:00
Mark DePristo	c8086a79e3	New StratificationManager based VariantEval passes unmodified integration tests -- Now needs cleanup and optimizations	2012-03-30 15:31:55 -04:00
Mark DePristo	d37f31e349	First version of VariantEval that runs (approximately correctly) with new StratificationManager	2012-03-30 15:31:54 -04:00
Mark DePristo	8971b54b21	Phase II of Stratification manager -- Renamed and reorganized infrastructure -- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO -- Ready for hookup to VE	2012-03-30 15:31:54 -04:00
Mark DePristo	9f1cd0ff66	Lots of new functionality for StratificationStates manager -- Really working according to unit tests -- A nCombination utils	2012-03-30 15:31:54 -04:00
Mark DePristo	a3d896d80e	Part I of creating a fast state space lookup for VE -- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map. -- Minor code cleanup throughout VE (removing unused headers, for example)	2012-03-30 15:31:53 -04:00

... 3 4 5 6 7 ...

2276 Commits (1fafd9f6c8b33271194fa3aaf6a5b05e73febb3b)