gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Ryan Poplin	693bfac341	Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places. -- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests	2014-02-05 12:58:48 -05:00
Eric Banks	740b33acbb	We were never validating the sequence dictionary of tabix indexed VCFs for some reason. Fixed. These changes happened in Tribble, but Joel clobbered them with his commit. We can now change the logging priority on failures to validate the sequence dictionary to WARN. Thanks to Tim F for indirectly pointing this out.	2014-02-05 10:12:38 -05:00
Eric Banks	9cac24d1e6	Moving logging status of VCF indexing to DEBUG instead of INFO, otherwise it's painful when reading in lots of files	2014-02-05 10:12:37 -05:00
Eric Banks	91bdf069d3	Some updates to CRCV. 1. Throw a user error when the input data for a given genotype does not contain PLs. 2. Add VCF header line for --dbsnp input 3. Need to check that the UG result is not null 4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)	2014-02-05 10:12:37 -05:00
Joel Thibault	7923e786e9	Rev Picard (public) to 1.107.1676 - Rename snappy to snappy-java - Add maven-metadata-local.xml to .gitignore	2014-02-04 22:04:28 -05:00
Joel Thibault	0025fe190d	Exclude sam's older TestNG	2014-02-04 22:04:27 -05:00
Karthik Gururaj	24f8aef344	Contains profiling, exception tracking, PAPI code Contains Sandbox Java	2014-02-04 16:27:29 -08:00
David Roazen	76086f30b7	Temporarily disable tests that started failing post-maven Joel is working on these failures in a separate branch. Since maven (currently! we're working on this..) won't run the whole test suite to completion if there's a failure early on, we need to temporarily disable these tests in order to allow group members to run tests on their branches again.	2014-02-04 15:31:24 -05:00
David Roazen	3b2f07990d	Re-break the MWUnitTest for Joel to debug	2014-02-04 15:19:09 -05:00
David Roazen	c9032f0b5c	Fix failing unit tests	2014-02-04 03:05:30 -05:00
Khalid Shakir	a4289711e2	Distinct failsafe summary reports, just like invoker report directories.	2014-02-03 13:50:47 -05:00
Khalid Shakir	857e6e0d6f	Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script.	2014-02-03 13:50:46 -05:00
Khalid Shakir	9ca3004fc3	Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency.	2014-02-03 13:50:46 -05:00
Khalid Shakir	de13f41fc3	One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules.	2014-02-03 13:50:46 -05:00
Khalid Shakir	25aee7164e	Fixed missing "mvn" command execution in ant-bridge. Added pom.xml workarounds for duplicate classpath error, due to gatk-framework dependency containing required BaseTest, and jarred UnitTest/IntegrationTest classes that also exist as files under target/test-classes.	2014-02-03 13:50:46 -05:00
Khalid Shakir	caa76cdac4	Added maven pom.xmls for various artifacts.	2014-02-03 13:50:46 -05:00
Khalid Shakir	d1a689af33	Added new utility files used by maven build, including the ant-bridge script.	2014-02-03 13:50:46 -05:00
Khalid Shakir	88150e0166	Switched commited dependency repository from ivy to maven.	2014-02-03 13:50:46 -05:00
Khalid Shakir	1e25a758f5	Moved files to maven directories. Here are the git moved directories in case other files need to be moved during a merge: git-mv private/java/src/ private/gatk-private/src/main/java/ git-mv private/R/scripts/ private/gatk-private/src/main/resources/ git-mv private/java/test/ private/gatk-private/src/test/java/ git-mv private/testdata/ private/gatk-private/src/test/resources/ git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/ git-mv private/scala/src/ private/queue-private/src/main/scala/ git-mv protected/java/src/ protected/gatk-protected/src/main/java/ git-mv protected/java/test/ protected/gatk-protected/src/test/java/ git-mv public/java/src/ public/gatk-framework/src/main/java/ git-mv public/java/test/ public/gatk-framework/src/test/java/ git-mv public/testdata/ public/gatk-framework/src/test/resources/ git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/ git-mv public/scala/src/ public/queue-framework/src/main/scala/ git-mv public/scala/test/ public/queue-framework/src/test/scala/	2014-02-03 13:50:44 -05:00
Khalid Shakir	faaef236ea	Moved gsalib, R and other resources, Queue GATK extensions generator, Queue version java files.	2014-02-03 13:49:21 -05:00
Khalid Shakir	eb52dc6a9b	Moved build.xml, ivy.xml, ivysettings.xml, ivy properties, public/packages/*.xml into private/archive/ant	2014-02-03 13:49:20 -05:00
Karthik Gururaj	6d4d776633	Includes code for all debug code for obtaining profiling info	2014-01-30 12:08:06 -08:00
Valentin Ruano-Rubio	89c4e57478	gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested. Changes: ------- <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles. When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake. Integration test MD5 have been changed accordingly. Additional fix: -------------- Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick. In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.	2014-01-30 11:23:33 -05:00
Karthik Gururaj	5c7427e48c	Temporary commit containing debug profiling code - commented out	2014-01-29 12:10:29 -08:00
Karthik Gururaj	0c63d6264f	1. Added synchronization block around loadLibrary in VectorLoglessPairHMM 2. Edited Makefile to use static libraries where possible	2014-01-27 15:34:58 -08:00
Karthik Gururaj	a15137a667	Modified run.sh	2014-01-27 14:56:46 -08:00
Karthik Gururaj	2c0d70c863	Moved vector JNI code to public/c++/VectorPairHMM	2014-01-27 14:52:59 -08:00
Karthik Gururaj	85a748860e	1. Added more profiling code 2. Modified JNI_README	2014-01-27 14:32:44 -08:00
Valentin Ruano-Rubio	748d2fdf92	Added Integration test to verify the bugs are not there anymore as reported in pivotracker	2014-01-26 23:29:31 -05:00
Karthik Gururaj	018e9e2c5f	1. Cleaned up code 2. Split into DebugJNILoglessPairHMM and VectorLoglessPairHMM with base class JNILoglessPairHMM. DebugJNILoglessPairHMM can, in principle, invoke any other child class of JNILoglessPairHMM. 3. Added more profiling code for Java parts of LoglessPairHMM	2014-01-26 19:18:12 -08:00
Valentin Ruano-Rubio	9e7bf75e89	Fix for the PairHMM transition probability miscalculation. Problem: matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1. Reports: https://www.pivotaltracker.com/s/projects/793457/stories/62471780 https://www.pivotaltracker.com/s/projects/793457/stories/61082450 Changes: The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies. MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel. Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied small changes in likelihoods that is what is expected.	2014-01-26 16:30:36 -05:00
Karthik Gururaj	81bdfbd00d	Temporary commit before moving to new native library	2014-01-24 16:29:35 -08:00
Karthik Gururaj	733a84e4f9	Added support to transfer haplotypes once per region to the JNI Re-use transferred haplotypes (stored in GlobalRef) across calls to computeLikelihoods	2014-01-22 10:52:41 -08:00
Karthik Gururaj	88c08e78e7	1. Inserted #define in sandbox pairhmm-template-main.cc 2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE 3. OpenMP disabled 4. Added code for initializing PairHMM's data inside initializePairHMM - not used yet	2014-01-21 09:57:14 -08:00
Karthik Gururaj	7180c392af	1. Integrated Mohammad's SSE4.2 code, Mustafa's bug fix and code to fix the SSE compilation warning. 2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in that order) 3. Created multiple files to compile with different compilation flags: avx_function_prototypes.cc is compiled with -xAVX while sse_function_instantiations.cc is compiled with -xSSE4.2 flag. 4. Added jniClose() and support in Java (HaplotypeCaller, PairHMMLikelihoodCalculationEngine) to call this function at the end of the program. 5. Removed debug code, kept assertions and profiling in C++ 6. Disabled OpenMP for now.	2014-01-20 08:03:42 -08:00
Yossi Farjoun	c79e8ca53e	Added an info log containing the SAM/BAM files that were eventually found from the commandline (useful for when there are files hiding inside bam.lists which may or may not have been constructed correctly...) Added a @hidden option controling the appearance of the full BamList in the log	2014-01-17 11:25:21 -05:00
Karthik Gururaj	f1c772ceea	Same log message as before - forgot -a option 1. Moved computeLikelihoods from PairHMM to native implementation 2. Disabled debug - debug code still left (hopefully, not part of bytecode) 3. Added directory PairHMM_JNI in the root which holds the C++ library that contains the PairHMM AVX implementation. See PairHMM_JNI/JNI_README first	2014-01-16 21:40:04 -08:00
Eric Banks	de56134579	Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF. It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now. Since I thought that it might prove useful to others, I moved it to protected and added integration tests. GERALDINE: NEW TOOL ALERT!	2014-01-15 13:49:31 -05:00
Geraldine Van der Auwera	edf5880022	Updated SAMPileup codec and pileup-related docs Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users. Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions. Tested by CheckPileupIntegrationTest which tests both format cases.	2014-01-14 09:14:16 -05:00
Eric Banks	16ecc53749	Merge pull request #469 from broadinstitute/gg_gatkdoc_fixes Assorted fixes and improvements to gatkdocs	2014-01-14 05:56:07 -08:00
droazen	347fab4717	Merge pull request #471 from broadinstitute/eb_output_log_info_for_tim Adding more meta information about the user to the GATK logging output, per Tim F's request.	2014-01-13 17:48:40 -08:00
Geraldine Van der Auwera	bdb3954eb3	removed maxRuntime minValue	2014-01-13 20:45:43 -05:00
Geraldine Van der Auwera	8fcad6680b	Assorted fixes and improvements to gatkdocs -Added docs for ERC mode in HC -Move RecalibrationPerformance walker since to private since it is experimental and unsupported -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them -Improved error msg for conflict between per-interval aggregation and -nt -Minor clean up in exception docs -Added Toy Walkers category for devs and dev supercat (to build out docs for developers) -Added more detailed info to GenotypeConcordance doc based on Chris forum post -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples) -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it) -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)	2014-01-13 17:46:22 -05:00
Eric Banks	851ec67bdc	Adding more meta information about the user to the GATK logging output, per Tim F's request.	2014-01-13 14:36:02 -05:00
droazen	7cd304fb41	Merge pull request #470 from broadinstitute/mf_new_RBP Mf new rbp	2014-01-13 08:46:27 -08:00
Eric Banks	0323caefc8	Added some bug fixes to the gVCF merging code after finally getting some real data to play with. Still under construction, awaiting more test data from Valentin.	2014-01-08 08:34:35 -05:00
Eric Banks	f172c349f6	Adding the functionality to enable users to input a file of VCFs for -V. To do this I have added a RodBindingCollection which can represent either a VCF or a file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so that one can intermix VCFs and VCF lists. For VariantContext tags with a list, by default the tags for the -V argument are applied unless overridden by the individual line. In other words, any given line can have either one token (the file path) or two tokens (the new tags and the file path). For example: foo.vcf VCF,name=bar bar.vcf Note that a VCF list file name must end with '.list'. Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.	2014-01-08 00:45:00 -05:00
Menachem Fromer	d1275651ae	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2014-01-03 01:13:40 -05:00
Ami Levy-Moonshine	6da53aea09	Write a new tool for spliting reads that have N cigar string. For example, this tool can be used for processing bowtie RNA-seq data. Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases. In order to do it, few changes were introduced to some other clipping methods: - make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I. - change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns create unitTests for that walker: - change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication - move some useful methods from ReadClipperTestUtils to CigarUtils create integration test for that class small change in a comment in FullProcessingPipeline last commit: Address review comments: - move to protected under walkers/rnaseq - change the read splitting methods to be more readable and more efficiant - change (minor changes) some methods in ReadClipper to allow the changes in split reads - add (minor change) one method to CigarUtils to allow the changes in split reads - change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar - address the rest of the review comments (minor changes) - fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion). - add another test to ReadUtilsUnitTest.testReadWithNs - Allow the user to print the split positions (not working proparly currently)	2014-01-01 22:21:36 -05:00
Mauricio Carneiro	d1febb89c8	Better documentation for ReadClippingStats walker * add overall walker GATKDocs * add explanation for skip parameter and make it advanced * reverse the logic on exculding unmapped reads for clarity * fix read length calculation to no longer include indels ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and calculates the entire statistic for the reads it is eventually going to skip. This would be an easy fix, but only worth our time if people actually use this.	2014-01-01 14:26:26 -05:00
Eric Banks	f82a7c3f4c	Updating variant jar. The update contains: 1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele) 2. better error messages for VCFs containing complex rearrangements with breakends 3. instead of failing badly on format field lists with '.'s, just ignore them Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC. Delivers PT#s 59675378, 59496612, and 60524016.	2013-12-31 22:48:29 -05:00
Eric Banks	5a1564d1f2	Merge pull request #456 from broadinstitute/eb_unify_hc_combination_steps Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.	2013-12-31 18:57:27 -08:00
Eric Banks	83e09b1f64	Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline. Basically, it does 3 things (as opposed to having to call into 3 separate walkers): 1. merge the records at any given position into a single one with all alleles and appropriate PLs 2. re-genotype the record using the exact AF calculation model 3. re-annotate the record using the VariantAnnotatorEngine In the course of this work it became clear that we couldn't just use the simpleMerge() method used by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new utility method to handle this merging and pulled any related code out of CombineVariants. I tried to clean up a lot of that code, but ultimately that's out of the scope of this project. Added unit tests for correctness testing. Integration tests cannot be used yet because the HC doesn't output correct gVCFs.	2013-12-31 12:07:56 -05:00
Menachem Fromer	48ef7a1a2f	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2013-12-19 10:42:20 -05:00
David Roazen	4a79831adc	Add ability to specify min/max required/recommended values for numeric arguments in the @Argument annotation -You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes to @Argument annotations for command-line arguments -"minValue" and "maxValue" specify hard limits that generate an exception if violated -"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated -Works only for numeric arguments (int, double, etc.) with @Argument annotations -Only considers values actually specified by the user on the command line, not default values assigned in the code As requested by Geraldine	2013-12-18 18:09:08 -05:00
Eric Banks	400e7c1404	Fixed bug in the filtering of lifted over variants where a deletion at the end of a contig could cause it to error out. Added a unit test.	2013-12-11 14:07:18 -05:00
Eric Banks	418fbdfbab	Added HC trio calls and NA12878 KB snapshot to resource bundle. Also, don't touch the current link until the resources are finished being produced.	2013-12-07 22:08:34 -05:00
David Roazen	932cd3ada7	Fix 3rd-party library dependency issues in the HC/PairHMM tests In general, test classes cannot use 3rd-party libraries that are not also dependencies of the GATK proper without causing problems when, at release time, we test that the GATK jar has been packaged correctly with all required dependencies. If a test class needs to use a 3rd-party library that is not a GATK dependency, write wrapper methods in the GATK utils/* classes, and invoke those wrapper methods from the test class.	2013-12-06 13:16:55 -05:00
David Roazen	0e65296efb	Rev picard, sam-jdk, tribble, and variant jars to 1.104.1628 -update VariantFiltration to work with new Lazy wrapper around the JexlEngine in VariantContextUtils	2013-12-05 12:45:32 -05:00
Joel Thibault	5fe0531b4d	Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy	2013-12-03 23:12:14 -05:00
Joel Thibault	8571a641bf	Add @Advanced to variant_index_type and variant_index_parameter	2013-12-03 23:12:14 -05:00
Joel Thibault	fd0a02e52e	New VCF engine arguments to specify an alternate IndexCreator - CatVariants updates to use custom VCF indices - Scala scripts for VCF index testing	2013-12-03 13:31:02 -05:00
Joel Thibault	42f78bdb3a	Add a class-based DataProvider	2013-12-03 13:31:01 -05:00
Joel Thibault	cd3ee2ae7e	whitespace	2013-12-03 13:31:01 -05:00
Eric Banks	6bee6a1b53	Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles. Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info. Now the PLs and AD get correctly selected down. While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just broke it out into smaller chunks as per our best practices. Added unit tests and updated integration tests. Addressed reviews.	2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio	0f99778a59	Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test	2013-12-02 19:37:19 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Geraldine Van der Auwera	429582589f	Set SAMFileWriter to create index in ReadUtils to fix SplitSamFile issue	2013-11-26 15:54:47 -05:00
Geraldine Van der Auwera	25bc6e64ae	Patched Queue extensions lacking a main class definition	2013-11-22 14:57:09 -05:00
Ami Levy-Moonshine	6ad841cec5	Rewrite ReadLengthDistribution to count the read lengths into a hash table first and only at the end to produce a GATK report table. Before that fix, the tool was couldn't work with more then one RG before. - Address all review comments	2013-11-18 17:29:31 -05:00
Ami Levy-Moonshine	9c1023c933	fix a (ugly) weird error from last commit that changed all the scala files to end with MoleculoPipeline.scala	2013-11-18 11:44:24 -05:00
MauricioCarneiro	7f08250870	Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2 Improve the PairHMM API for better FPGA integration	2013-11-14 10:47:07 -08:00
bradtaylor	e40a07bb58	Improve the PairHMM API for better FPGA integration Motivation: The API was different between the regular PairHMM and the FPGA-implementation via CnyPairHMM. As a result, the LikelihoodCalculationEngine had to use account for this. The goal is to change the API to be the same for all implementations, and make it easier to access. PairHMM PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap. Added a new primary method that loops the reads and haplotypes, extracts qualities, and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method. Did not alter that method, or its subcompute method, at all. PairHMM also now handles its own (re)initialization, so users don't have to worry about that. CnyPairHMM Added that same new primary access method to this FPGA class. Method overrides the default implementation in PairHMM. Walks through a list of reads. Individual-read quals and the full haplotype list are fed to batchAdd(), as before. However, instead of waiting for every read to get added, and then walking through the reads again to extract results, we just get the haplotype-results array for each read as soon as it is generated, and pack it into a perReadAlleleLikelihoodMap for return. The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not. LikelihoodCalculationEngine The functionality to loop through the reads and haplotypes and get individual log10-likelihoods was moved to the PairHMM, and so removed from here. However, this class does need to retain the ability to pre-process the reads, and post-process the resulting likelihoods map. Those features were separated from running the HMM and refactored into their own methods Commented out the (unused) system for finding best N haplotypes for genotyping. PairHMMIndelErrorModel Similar changes were made as to the LCE. However, in this case the haplotypes are modified based on each individual read, so the read-list we feed into the HMM only has one read.	2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera	f22ab033f6	Merge pull request #424 from broadinstitute/gg_yetanothergatkdocfix Yet another gatkdoc fix	2013-11-13 11:35:59 -08:00
Geraldine Van der Auwera	dac3dbc997	Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle Clarified caveat for InbreedingCoefficient Cleaned up docstrings for ReduceReads Brushed up doc for ErrorRatePerCycle	2013-11-13 14:33:04 -05:00
Phillip Dexheimer	296bcc7fb1	Changed name of jobs submitted to cluster job runners -- Added 'jobRunnerJobName' definition to QFunction, defaults to value of shortDescription -- Edited Lsf and Drmaa JobRunners to use this string instead of description for naming jobs in the scheduler Signed-off-by: Joel Thibault <thibault@broadinstitute.org>	2013-11-12 14:34:56 -05:00
Mauricio Carneiro	725656ae7e	Generalizing the FullProcessingPipeline Qscript We have generalized the processing script to be able to handle multiple scenarios. Originally it was designed for PCR free data only, we added all the steps necessary to start from fastq and process RNA-seq as well as non-human data. This is our go to script in TechDev. * add optional "starting from fastq" path to the pipeline * add mark duplicates (optionally) to the pipeline * add an option to run with the mouse data (without dbsnp and with single ended fastq) * add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary) * add option to filter or include reads with N in the cigar string * add parameter to allow keeping the intermediate files	2013-11-07 16:34:29 -05:00
Eric Banks	f15355856a	Merge pull request #418 from broadinstitute/eb_fix_liftover_script Fixing the liftover script to not require strict VCF header validation.	2013-11-07 06:04:56 -08:00
Eric Banks	2fc40a0aed	Fixing the liftover script to not require strict VCF header validation. Apparently no one has used the liftover script for a while (which I guess is a good thing)...	2013-11-07 09:02:17 -05:00
Eric Banks	0e3d83d1ef	Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline Improvements to the reference model pipeline.	2013-11-05 06:33:17 -08:00
Eric Banks	09dfaf1a68	Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline Add interpretation to QualifyMissingIntervals	2013-11-05 06:08:13 -08:00
Eric Banks	96024403bf	Update the dbsnp version in the bundle from 137 to 138; resolves PT #59771004 .	2013-11-04 10:01:22 -05:00
Ryan Poplin	b22c9c2cb4	Improvements to the reference model pipeline. -- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants) -- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.	2013-11-01 17:58:25 -04:00
Eric Banks	cafcb34855	Merge pull request #411 from broadinstitute/eb_add_exome_intervals_to_bundle_script Updated the GATK bundle script to:	2013-10-29 07:38:44 -07:00
Eric Banks	209f2a61aa	Updated the GATK bundle script to: 1. Include exome target list for b37 2. Not delete the 'current' link unless -run is applied to the command line! (sorry, Ryan)	2013-10-29 10:33:51 -04:00
Louis Bergelson	9498950b1c	Adding more specific error message when one of the scripts doesn't exist. --Previously it gave a cryptic message: ----IO error while decoding blarg.script with UTF-8 ----Please try specifying another one using the -encoding option	2013-10-21 14:57:42 -04:00
David Roazen	5a2ef37ead	Tweak dcov documentation to help prevent user confusion Geraldine-approved!	2013-10-16 15:24:33 -04:00
Mauricio Carneiro	efbfdb64fe	Qscript to Downsample and analyze an exome BAM this script downsamples an exome BAM several times and makes a coverage distribution analysis (of bases that pass filters) as well as haplotype caller calls with a NA12878 Knowledge Base assessment with comparison against multi-sample calling with the UG. This script was used for the "downsampling the exome" presentation	2013-10-10 14:37:33 -04:00
Chris Hartl	55bab9fa87	Merged bug fix from Stable into Unstable	2013-10-10 13:01:12 -04:00
Chris Hartl	06d28c7f8b	VariantsToBinaryPed: Move .fam file writing to initialize to ensure ordering matches the ordering of the VCF. Change the documentation to clarify that the fam files are not directly copied, but subset and re-ordered.	2013-10-10 12:53:15 -04:00
Mauricio Carneiro	5d6421494b	Fix mismatching number of columns in report Quick fix the missing column header in the QualifyMissingIntervals report. Adding a QScript for the tool as well as a few minor updates to the GATKReportGatherer.	2013-10-09 14:38:15 -04:00
Ryan Poplin	f3a67edc24	Merge pull request #402 from broadinstitute/gg_dcov_docs Improvements to gatkdocs related to downsampling	2013-09-27 07:07:21 -07:00
kshakir	a29f1f84bf	Merge pull request #397 from lbergelson/lb_scala_2.10.2 Update scala from 2.9 to 2.10.2	2013-09-26 21:51:43 -07:00
Geraldine Van der Auwera	66d0235efc	Minor clarifications & formatting tweaks for dcov docs	2013-09-26 14:28:22 -04:00
Michael McCowan	5113e21437	Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion. * Updated expected test data in which an integer annotation (MQ0) was formatted as a double.	2013-09-25 13:12:02 -04:00
Louis Bergelson	c05208ecec	Resolving warnings --specifying exception types in cases where none was already specified ----mostly changed to catch Exception instead of Throwable ----EmailMessage has a point where it should only be expecting a RetryException but was catching everything --changing build.xml so that it prints scala feature warning details --added necessary imports needed to remove feature warnings --updating a newly deprecated enum declaration to match the new syntax	2013-09-23 12:42:22 -04:00
Louis Bergelson	b32ad99d3f	Changing from scala 2.9.2 to 2.10.2. --modified ivy dependencies --modified scala classpath in build.xml to include scala-reflect --changed imports to point to the new scala scala.reflect.internal.util --set the bootclasspath in QScriptManager as well as the classpath variable. --removing Set[File] <-> Set[String] conversions ----Set is invariant now and the conversions broke --removing unit tests for Set[File] <-> Set[String] conversions	2013-09-23 12:42:22 -04:00
chapmanb	2f5064dd1d	Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions. Allows utilization of CoveredLocusView via the API Signed-off-by: David Roazen <droazen@broadinstitute.org>	2013-09-10 15:32:54 -04:00
Geraldine Van der Auwera	292426b504	Merge pull request #390 from broadinstitute/mc_update_clipreads Added REVERT SOFTCLIPPED bases to ClipReads	2013-09-09 16:43:03 -07:00
Geraldine Van der Auwera	8b829255e7	Clarified docs on using clipping options	2013-09-09 19:40:03 -04:00

1 2 3 4 5 ...

4216 Commits (ac1cefce298765f782cdfa2cd4e1eb79fccc4daf)