gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	e02ec8c8b6	Don't update the record ID unless we are actually going to emit the record	2012-06-04 14:58:50 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Ryan Poplin	f11e7ebc3a	Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA.	2012-06-04 12:49:36 -04:00
Ryan Poplin	320956ee4b	Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage.	2012-06-04 10:55:36 -04:00
Guillermo del Angel	7a54baf08c	Merged bug fix from Stable into Unstable	2012-06-03 08:42:08 -04:00
Guillermo del Angel	47df7bbc14	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-06-03 08:38:54 -04:00
Guillermo del Angel	2ddbdee3bc	Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow	2012-06-03 08:38:38 -04:00
Mauricio Carneiro	12a8c54f9a	Fixing VCF header for filter elements (thanks Eric)	2012-06-01 15:45:15 -04:00
Eric Banks	3a15ba2102	Malformed VCF headers should be User Errors	2012-05-31 16:05:53 -04:00
Khalid Shakir	c4f7df4dce	When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>".	2012-05-30 16:39:06 -04:00
Ryan Poplin	421d0d1435	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-30 15:21:35 -04:00
Ryan Poplin	5dd811f84a	Adding genotype given alleles mode to the HaplotypeCaller.	2012-05-30 15:07:01 -04:00
Eric Banks	d09b8d5584	Fixing docs	2012-05-30 13:24:08 -04:00
Mauricio Carneiro	d6e1205310	Updating default values for DiagnoseTargets	2012-05-30 12:43:07 -04:00
Khalid Shakir	c3c7f17d90	Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000. Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run. TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.	2012-05-29 11:18:22 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Eric Banks	50031b63c5	Fix possible NPE from NBaseCount annotation module	2012-05-29 09:46:00 -04:00
Mark DePristo	08de4dfd96	Missed one integration test	2012-05-29 07:23:24 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	7ce24a96f1	PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present	2012-05-28 20:18:16 -04:00
Mark DePristo	1818c29371	Fixed long-standing bug in beagle codec that was passing on the header record for decoding	2012-05-28 20:17:26 -04:00
Mark DePristo	06b02e1b9b	Update MD5s to reflect new limited output of DiffObjectsWalkers -- Also updated GQ change in VCFIntegrationTest	2012-05-27 11:20:47 -04:00
Mark DePristo	5894d045cb	Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests -- This version of BCF should actually work properly for most files, assuming headers are properly defined. -- Lots of bug fixes to BCF2 codec -- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL. NOTE THIS SEMANTICS change -- Equals() method for GenotypeLikelihoods, using PLs. -- VCFCodec now longer adds empty bindings to missing input field values. NOTE THIS CHANGE -- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work. The BCF2 codec now makes VCs marked as fully decoded -- stringToBytes returns empty list for null or "" string in BCF2Encoder -- Proper handling of genotype ordering in BCF2 reader / writer -- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily -- Many failing MD5s now due to double -> int change in GQ, will update later	2012-05-27 11:17:17 -04:00
Mark DePristo	86e5a066fc	Even more conservative limit on number of differences to summarize at 1000	2012-05-27 11:17:13 -04:00
Mark DePristo	31f4e5b52e	Stop unlimited runtimes in DiffEngine when you have lots of differences -- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000	2012-05-27 11:17:13 -04:00
Guillermo del Angel	a6ee4f98b5	Yet More missing md5's	2012-05-25 17:21:47 -04:00
Mauricio Carneiro	4109fcbb08	Merged bug fix from Stable into Unstable	2012-05-25 13:03:05 -04:00
Mauricio Carneiro	2be5704a25	Fixed haplotype boundary bug in PairHMMIndelErrorModel haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception). * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate. * removed unnecessary code to remove hard clips after processing. * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).	2012-05-25 13:00:45 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	d6df817174	Oops, don't enable shadow BCF tests	2012-05-24 13:31:13 -04:00
Mark DePristo	0a86564669	Updated test files didn't make it into last push	2012-05-24 13:29:44 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	ade1843818	Bugfix for not setting header in AbstractVCFCodec	2012-05-24 10:58:58 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	69ee4d0454	Moved getMetaDataForField to VariantContextUtils	2012-05-24 10:57:09 -04:00
Mark DePristo	cb13f16e90	WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file -- Currently disabled	2012-05-24 10:57:09 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Mark DePristo	4bde24f020	Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header	2012-05-24 10:57:08 -04:00
Mark DePristo	4846bf5c8e	@Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf -- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically	2012-05-24 10:57:07 -04:00
Mark DePristo	bb0d87666a	Finally just deleted equals() method in GATKArgumentCollection. -- We never compare these things in the codebase anyway...	2012-05-24 10:57:07 -04:00
Mark DePristo	6f469305ab	Don't try to share BCF2 yet	2012-05-24 10:57:06 -04:00
Mark DePristo	c8ed0bfc4c	Edge case fixes for BCF2 --handle entirely missing GT in a sample in decodeGenotypeAlleles --Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code -- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples -- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys -- Removed restriction on getPloidy requiring ploidy > 1. It's logically find to return 0 for a no called sample -- getMaxPloidy() in VC that does what it says -- Support for padding / depadding of generic genotype fields	2012-05-24 10:57:06 -04:00
Mark DePristo	40431890be	-- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it -- BCF2 writer can now work without the contig lines being in the header -- Made GenomeLocParser a final class	2012-05-24 10:57:06 -04:00
Mark DePristo	6301572009	GenotypeLikelihood PLs are capped at Short.MAX_INT now -- UserExceptions in BCF2 now where appropriate -- Asserts for code safety -- Public -> protected encode(Object v) method is for testing only	2012-05-24 10:57:06 -04:00
Mark DePristo	d52bc31a47	Bugfix for doNotWriteGenotypes mode -- Was outputing GT ./. in sites only mode. Fixed	2012-05-24 10:57:05 -04:00
Mark DePristo	64d4238e2f	99% working version of BCF2 encoder / decoder -- fixed final bugs with PL encoding / decoding -- Ready for testing by other members of the group -- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations -- Fixed a nasty bug in the filter field -- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types Read 1500 genotypes file in VCF -> VCF : 11 seconds Read 1500 genotypes file in VCF -> BCF : 9.5 seconds VariantEval 1500 genotypes file in VCF : 3 seconds VariantEval 1500 genotypes file in BCF : 3 seconds	2012-05-24 10:57:05 -04:00
Mark DePristo	b5bce8d3f9	AD should be UNBOUNDED, actually -- Pass in # alt alleles as appropriate for getCount in VCF header line	2012-05-24 10:57:05 -04:00
Mark DePristo	aaf11f00e3	Near final BCF2 implementation -- Trivial import changes in some walkers -- SelectVariants has a new hidden mode to fully decode a VCF file -- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type -- GenotypeLikelihoods now implements List<Double> for convenience. The PL duality here is going to be removed in a subsequent commit -- BugFixes in BCF2Writer. Proper handling of padding. Bugfix for nFields for a field -- padAllele function in VariantContextUtils -- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.	2012-05-24 10:57:02 -04:00
Mark DePristo	dfee17a672	Generalize / unify code for handling strings -- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder. -- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding -- Code cleanup and renaming	2012-05-24 10:57:02 -04:00
Mark DePristo	b4a5acd6f4	Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't	2012-05-24 10:57:01 -04:00
Mark DePristo	373ae39e86	Testing of BCF codec -- Rev.d tribble -- Minor code cleanup -- BCF2 encoder / decoder use Double not Float internally everywhere -- Generalized VC testing framework	2012-05-24 10:57:01 -04:00
Mark DePristo	fb1911a1b6	-- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder -- Convenience routine for creating alleles from strings of bases -- Convenience constructor for VCFFilterHeader line whose description is the same as name -- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes. Can be reused throughtout code for BCF, VCF, etc. -- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go	2012-05-24 10:57:01 -04:00
Mark DePristo	4968dcd36a	Throw an error when genotype fields with mixed vector lengths are encountered	2012-05-24 10:57:00 -04:00
Mark DePristo	afd2f1a3f9	Individual VariantContextWriters are now package protected -- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it -- Update build.xml to build vcf.jar with updated paths and bcf2 support.	2012-05-24 10:57:00 -04:00
Mark DePristo	24864fd5b0	GATK now writes BCF output to any file with .bcf extension -- Moved VCF and BCF writers to variantcontext.writers -- Updated vcf.jar build path -- Refactored VCFWriter and other code. Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory. Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.	2012-05-24 10:57:00 -04:00
Mark DePristo	e2311294c0	Removed unused ManualSortingVCFWriter	2012-05-24 10:56:59 -04:00
Mark DePristo	93cef82637	BCF2 header encoding decoding at final spec	2012-05-24 10:56:58 -04:00
Mark DePristo	ce9e9eebb1	No dictionary in header. Now built dynamically from the header in the writer and codec -- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there	2012-05-24 10:56:58 -04:00
Mark DePristo	f0b081a85f	Update VCF.jar loading test -- to reflect new path to VCFWriter	2012-05-24 10:56:58 -04:00
Mark DePristo	c3b8048e2e	Moving around classes in VCF and BCF2 -- Refactored VCF writers into vcf.writers package -- Moved BCF2Writer to bcf2.writer -- Updates to all of the walkers using VCFWriter to reflect new packages -- A large number of files had their headers cleaned up because of this as well	2012-05-24 10:56:58 -04:00
Mark DePristo	679ffdd333	Move BCF2 from private utils to public codecs	2012-05-24 10:56:56 -04:00
Mark DePristo	450f098a61	BCF2 encoder / decoder implement new site / genotype block organization -- Supports final organization of data blocks into sites data and genotypes data	2012-05-24 10:56:55 -04:00
Mark DePristo	27b51d4dea	Enable on the fly indexing of BCF2	2012-05-24 10:56:54 -04:00
Mark DePristo	81bd7646d6	Fix for MISSING floats -- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int). -- Now handles correctly MISSING qual fields	2012-05-24 10:56:53 -04:00
Mark DePristo	3afbc50511	More BCF2 improvements -- Refactored setting of contigs from VCFWriterStub to VCFUtils. Necessary for proper BCF working -- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order. -- Cleaned up VCFHeader operations -- BCF now uses the right header files correctly when encoding / decoding contigs -- Clean up unused tools -- Refactored header parsing routines to make them more accessible -- More minor header changes from Intellij	2012-05-24 10:56:52 -04:00
Mark DePristo	0799855479	Archiving GCF -- Rider update to CramByPiece.scala	2012-05-24 10:56:51 -04:00
Guillermo del Angel	43919078cd	Merged bug fix from Stable into Unstable	2012-05-23 21:21:01 -04:00
Guillermo del Angel	4bc04e2a9e	Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler.	2012-05-23 21:19:30 -04:00
Ryan Poplin	08dfd6cab6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-21 16:47:07 -04:00
Ryan Poplin	04000d920c	Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads.	2012-05-21 16:46:59 -04:00
Eric Banks	666862af19	Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs	2012-05-21 16:03:29 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Eric Banks	7f5ec17d22	Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table.	2012-05-21 14:16:13 -04:00
Eric Banks	92d8aa3d4c	Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels	2012-05-21 09:38:52 -04:00
Eric Banks	3af3834d50	Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian): * For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE. Switched over to booleans. * We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.	2012-05-18 11:55:41 -04:00
Eric Banks	26968ae8eb	Forgot that the VCFStreamingOntegrationTest uses VE	2012-05-18 02:51:53 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	5189b06468	New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality.	2012-05-17 15:28:19 -04:00
Eric Banks	0f7c917e7a	Better error checking and messages for bad alleles	2012-05-17 13:36:42 -04:00
Eric Banks	d44886d9e8	Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed.	2012-05-15 10:29:04 -04:00
Eric Banks	819c3d0c15	Adding to the Hrun docs	2012-05-15 10:27:52 -04:00
Guillermo del Angel	5fc3adbb04	One more VariantsToTable bug fix	2012-05-14 14:10:07 -04:00
Guillermo del Angel	04d691f04a	Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit	2012-05-14 14:01:29 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00
Ryan Poplin	c9dd0f3173	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-10 13:09:10 -04:00
Ryan Poplin	0cdadffe14	Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes.	2012-05-10 13:09:03 -04:00
Guillermo del Angel	89f8a6b2e6	Revert bad part of last commit that shouldn't have been pushed	2012-05-10 10:41:08 -04:00
Guillermo del Angel	27b1aa5dd3	Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's	2012-05-10 10:29:19 -04:00
Eric Banks	4f37d6d399	Fixing docs	2012-05-10 00:56:00 -04:00
Mark DePristo	c81acfc15d	Working implementation of BCF2 -- Nearly complete on spec implementation. Slow but clean -- Some refactoring of VariantContext to support common functions for BCF and VCF	2012-05-08 19:46:51 -04:00
Mark DePristo	a5193c2399	Mostly complete reference implementation of BCF2 -- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF	2012-05-08 19:46:51 -04:00
Eric Banks	473d07b0c5	fixing up docs from previous Pool Caller commit	2012-05-08 11:02:55 -04:00
Eric Banks	b4999d14c1	updating docs	2012-05-08 10:58:46 -04:00
Guillermo del Angel	33a1dd2048	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 10:42:12 -04:00
Eric Banks	5cf4fd63c2	Catch malformed base qualities and throw as a User Error	2012-05-08 09:34:57 -04:00
Guillermo del Angel	a4f4b5007b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 09:34:33 -04:00
Guillermo del Angel	605984353f	Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model.	2012-05-08 09:33:38 -04:00
Eric Banks	c40cda7e3c	Nope, loads of integration tests had to be changed.	2012-05-07 14:30:42 -04:00
Eric Banks	66838a073e	Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed...	2012-05-07 12:20:11 -04:00
David Roazen	6b769e91d8	BCF2: third checkpoint * writer mostly implemented * walkers to convert BCF2 <-> VCF * almost working for sites-only files; genotypes still need work * initial performance tests this afternoon will be on sites-only files	2012-05-04 13:00:15 -04:00
Eric Banks	f3433201b1	Merged bug fix from Stable into Unstable	2012-05-03 11:11:00 -04:00
Eric Banks	557da77a1a	Don't compute QD if there is no QUAL; added integration test for this	2012-05-03 11:02:37 -04:00
Eric Banks	1fc7b5d58b	Merged bug fix from Stable into Unstable	2012-05-03 10:37:58 -04:00
Laurent Francioli	567d01cee8	- Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:49 -04:00
Laurent Francioli	96e5a26223	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:20 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Eric Banks	e448cfcc59	Forgot to update these md5s	2012-05-02 21:09:50 -04:00
Khalid Shakir	b8b7f28aa9	Revving Picard to pick up new SamFileHeaderMerger. Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut(). In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124. - Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs. - Updated md5s to match new expectations after looking at TLEN diff engine output.	2012-05-02 16:47:28 -04:00
Mauricio Carneiro	f51a1d0d61	Better error message to the BAMScheduler In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.	2012-05-02 16:10:00 -04:00
Mauricio Carneiro	940029fa5d	Fixing on-the-fly recalibration (caught by Ryan) low quality bases in the tails were being turned to N's in the final read.	2012-05-02 16:06:04 -04:00
Eric Banks	623b36fbc4	Add header lines for AC,AF, and AN tags	2012-05-02 15:33:34 -04:00
Guillermo del Angel	429800a192	Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value.	2012-05-02 09:57:06 -04:00
Guillermo del Angel	76a95fdedf	Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though).	2012-05-02 09:24:28 -04:00
Joel Thibault	4d732fa586	Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb	2012-05-01 18:23:51 -04:00
Eric Banks	619a69a5f1	As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).	2012-05-01 16:18:24 -04:00
Joel Thibault	c255dd5917	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:10:38 -04:00
Ryan Poplin	51af61b5d7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:07:23 -04:00
Ryan Poplin	fc55dcec3c	Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now.	2012-05-01 16:02:36 -04:00
Ryan Poplin	20a0078f23	Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big.	2012-05-01 15:51:36 -04:00
Eric Banks	0f3af9555b	Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.	2012-05-01 14:58:06 -04:00
Joel Thibault	aa4d41cce0	Minor cleanup before push	2012-05-01 14:16:44 -04:00
Joel Thibault	b101b9c30b	Add Mongo switch	2012-05-01 14:00:48 -04:00
Joel Thibault	1b609e9075	Move Mongo to server couchdb	2012-05-01 13:59:47 -04:00
Joel Thibault	fd57d27f45	Move MongoDB connection handling to a separate class	2012-05-01 13:59:37 -04:00
Joel Thibault	db3cd1abd5	Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes.	2012-05-01 13:57:23 -04:00
Joel Thibault	04e1be9106	Better handling of Mongo errors + exceptions	2012-05-01 13:57:23 -04:00
Joel Thibault	ca737479cf	Query for stop locations because we don't have that information in the reference	2012-05-01 13:57:23 -04:00
Joel Thibault	1cda87a4ad	Set ROD priority list to input	2012-05-01 13:57:23 -04:00
Joel Thibault	a7fe847faf	Set the priority list and don't bother combining if not needed	2012-05-01 13:57:23 -04:00
Joel Thibault	f739305f43	Combine the variants found at a location	2012-05-01 13:57:23 -04:00
Joel Thibault	020f884d5a	Use new key of source ROD plus alleles	2012-05-01 13:57:23 -04:00
Joel Thibault	221ce9c3d6	Add alleles to the primary key	2012-05-01 13:57:23 -04:00
Joel Thibault	3198ce5471	Can have multiple variants at a location	2012-05-01 13:57:22 -04:00
Joel Thibault	11ed8e61c9	Add referenceBaseForIndel to the Mongo VariantContext objects	2012-05-01 13:53:44 -04:00
Joel Thibault	7ed0ee7ed0	Skip locations with no genotypes instead of throwing a NPE	2012-05-01 13:53:44 -04:00
Joel Thibault	4bdfeacdaa	Handle multiple samples/genotypes per location TODO: sample selection	2012-05-01 13:53:43 -04:00
Joel Thibault	1f7c628796	Insert the ROD filename into MongoDB as part of the primary key	2012-05-01 13:53:43 -04:00
Joel Thibault	bb8a6e9b0a	Initial test of write and read from MongoDB	2012-05-01 13:53:43 -04:00
David Roazen	c0084c741b	Pilot BCF2 Implementation: Checkpointing the code * Not working yet, still very much a work-in-progress with lots of placeholders * Needed to check this in to enable possible collaboration, since it's going slower than anticipated and the conference deadline looms.	2012-05-01 12:23:10 -04:00
Eric Banks	0c8e801021	Removing public to private dependency	2012-05-01 11:04:11 -04:00
Eric Banks	e964d17518	Removing public to private dependency	2012-05-01 11:02:28 -04:00
Mauricio Carneiro	462450c3e3	disabling all BQSR unit tests with the changes to the cycle covariate, some tests need updates, others need to be completely re-written.	2012-04-30 14:39:55 -04:00
Guillermo del Angel	e185632013	Exhaustive unit tests for Pool SNP genotype likelihoods: a) Add ability for ErrorModel to be specified by external log-probability vector for testing. b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups c) Misc. cleanups and beautification.	2012-04-30 14:29:46 -04:00
Christopher Hartl	7d029b9a28	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-30 12:16:30 -04:00
Christopher Hartl	944a7d815e	Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels.	2012-04-28 11:31:03 -04:00
Ryan Poplin	54a9bc2da2	Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles.	2012-04-28 09:12:26 -04:00

1 2 3 4 5 ...

2235 Commits (5ec737f008b8906c5cf7f5e7ddf7fd75e2ae4fbb)