gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Joel Thibault	cb7ad01202	Re-enable the relevant tests	2014-02-14 12:34:08 -05:00
Joel Thibault	c8a5007c85	Add a comment to the method where the error appears	2014-02-14 11:40:22 -05:00
Joel Thibault	ec16439387	Clear the ReadCovariates keysCache before runs of individual Unit Tests - normal runs have a constant covariate count, so this is not necessary	2014-02-14 10:41:28 -05:00
Eric Banks	7095a60c8e	Merge pull request #516 from broadinstitute/dr_reenable_tests_failing_due_to_java_update Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation	2014-02-13 21:05:18 -05:00
David Roazen	4b4b93ad1b	Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation After extensive detective work, Joel determined that these tests were failing due to changes in the implementation of Math.pow() in newer versions of Java 1.7. All GSA members should ensure that they're using a JDK that is at least as current as the one in the Java-1.7 dotkit on the Broad servers (build 1.7.0_51-b13).	2014-02-12 16:08:16 -05:00
Joel Thibault	cc9477aedb	Minimal test for the multi-allelic reordering bug	2014-02-12 13:38:32 -05:00
Eric Banks	300b474c96	Several improvements to the single sample combining steps. 1. updated QualByDepth not to use AD-restricted depth if it is zero. Added unit test this change. 2. Fixed small bug in CombineGVCFs where spanning deletions were not being treated consistently throughout. Added test for this situation. 3. Make sure GenotypeGVCFs puts in the required headers. Updated test files to make sure this is covered. 4. Have GenotypeGVCFs propagate up the MLEAC/AF (which were getting clobbered out). Tests updated to account for this.	2014-02-12 10:15:12 -05:00
David Roazen	95e1402d21	Add ability to run *KnowledgeBaseTests to maven Run with: mvn verify -Dsting.knowledgebasetests.skipped=false	2014-02-11 14:08:24 -05:00
Eric Banks	303a60c8c6	Adding smarts to the QD annotation: when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Added new unit tests for QualByDepth.	2014-02-11 12:56:49 -05:00
Eric Banks	2e36dd9001	Refactoring of CombineGVCFs to make it run a lot faster. Creating new VariantContexts each time we broke up a block was very expensive because we break up blocks so often. Also, calling into GATKVariantContextUtils.simpleMerge was really hurting performance. MD5 changes because we no longer propogate any INFO fields (except for END) for reference blocks; the tests have the now unused BLOCK_SIZE field that now get dropped.	2014-02-11 03:18:52 -05:00
Eric Banks	abef6cfcb6	Removing parameters that were incorrectly copied over from RegenotypeVariants.	2014-02-08 23:44:32 -05:00
Eric Banks	659a9f0e79	Removing the test for BLOCK_SIZE since we no longer emit it	2014-02-08 21:28:07 -05:00
Valentin Ruano-Rubio	bf630abe88	Fixed nocall (./.) without PLs bug in GVCF output Story: https://www.pivotaltracker.com/story/show/65388246 Additional changes and notes: 1. The fix consist in forcing the output of all PLs by setting the standard flag for that '-allSitePLs'. 2. BP_RESOLUTION was handled differently to GVCF in some aspect that should be common. That has been fixed.	2014-02-07 19:30:26 -05:00
Karthik Gururaj	20a46e4098	Check only for SSE 4.1 (rather than SSE 4.2) when trying to use the SSE implementation of PairHMM	2014-02-07 15:19:55 -08:00
Karthik Gururaj	dc44b64ad8	1. Added support for building the PairHMM vector library into build.xml. The library is compiled using makefile and copied into the directory: build/java/classes/org/broadinstitute/sting/utils/pairhmm/ 2. Bundled the library into StingUtils.jar. Unpacked and loaded at runtime without the need to set java.library.path Caveats: Platform independence has probably been thrown out of the window. Assumptions: a. make command exists at /usr/bin/make b. rsync command exists at /usr/bin/rsync c. icc is in the PATH of the user	2014-02-07 13:13:59 -08:00
Eric Banks	d689f61005	Fixed up some of the genotype-level annotations being propogated in the single sample HC pipeline. 1. AD values now propogate up (they weren't before). 2. MIN_DP gets transferred over to DP and removed. 3. SB gets removed after FS is calculated. Also, added a bunch of new integration tests for GenotypeGVCFs.	2014-02-07 12:47:54 -05:00
Eric Banks	67ed0d2403	The UG engine can return a null VC if there are tons of alt alleles, causing Tim's merge jobs to fail. Pushing the null check up so that it doesn't error out in such cases.	2014-02-07 12:41:20 -05:00
Valentin Ruano-Rubio	4a3c8e68fa	Fixed out of order non-variant gVCF entries when trimming is active. Story: https://www.pivotaltracker.com/story/show/65319564	2014-02-07 11:03:26 -05:00
Eric Banks	eb463b505d	Remove a whole bunch of unused annotations from gVCF output. AC,AF,AN,FS,QD - they'll all be recomputed later. BLOCK_SIZE and MIN_GQ were not necessary. I also made the StrandBiasBySample annotation forced on when in gVCF mode. It turns out that its output wasn't compatible with BCF so I patched it (and the variant jar too).	2014-02-07 08:49:36 -05:00
Eric Banks	2648219c42	Implementation of a hierarchical merger for gVCFs, called CombineGVCFs. This tool will take any number of gVCFs and create a merged gVCF (as opposed to GenotypeGVCFs which produces a standard VCF). Added unit/integration tests and fixed up GATK docs.	2014-02-07 08:49:18 -05:00
Eric Banks	71b47a6148	Rename CombineReferenceCalculationVariants to GenotypeGVCFs	2014-02-06 15:46:19 -05:00
Khalid Shakir	3848159086	Added a set of serial tests to gatk/queue packages, which runs all tests under their package in one TestNG execution. New properties to disable regenerating example resources artifact when each parallel test runs under packagetest. Moved collection of packagetest parameters from shell scripts into maven profiles. Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests. When building picard libraries, run clean first. Fixed tools jar dependency in picard pom. Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.	2014-02-06 08:25:38 -05:00
Valentin Ruano Rubio	988e3b4890	Merge pull request #487 from broadinstitute/vrr_reference_model_with_trimming Get gVCF to work without --dontTrimActiveRegions	2014-02-05 22:52:17 -05:00
Valentin Ruano-Rubio	98ffcf6833	Get gVCF to work without --dontTrimActiveRegions Story: https://www.pivotaltracker.com/story/show/65048706 https://www.pivotaltracker.com/story/show/65116908 Changes: ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet. Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs	2014-02-05 22:50:45 -05:00
Ryan Poplin	693bfac341	Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places. -- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests	2014-02-05 12:58:48 -05:00
Eric Banks	91bdf069d3	Some updates to CRCV. 1. Throw a user error when the input data for a given genotype does not contain PLs. 2. Add VCF header line for --dbsnp input 3. Need to check that the UG result is not null 4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)	2014-02-05 10:12:37 -05:00
Joel Thibault	9eaee8c73c	Integration test for the -nt race condition corrupting AD and PL fields	2014-02-04 22:04:27 -05:00
David Roazen	1de7a27471	Disable an additional test that is runtime dependent on one of the temporarily-disabled tests	2014-02-04 16:07:58 -05:00
David Roazen	76086f30b7	Temporarily disable tests that started failing post-maven Joel is working on these failures in a separate branch. Since maven (currently! we're working on this..) won't run the whole test suite to completion if there's a failure early on, we need to temporarily disable these tests in order to allow group members to run tests on their branches again.	2014-02-04 15:31:24 -05:00
Khalid Shakir	857e6e0d6f	Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script.	2014-02-03 13:50:46 -05:00
Khalid Shakir	9ca3004fc3	Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency.	2014-02-03 13:50:46 -05:00
Khalid Shakir	de13f41fc3	One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules.	2014-02-03 13:50:46 -05:00
Khalid Shakir	caa76cdac4	Added maven pom.xmls for various artifacts.	2014-02-03 13:50:46 -05:00
Khalid Shakir	1e25a758f5	Moved files to maven directories. Here are the git moved directories in case other files need to be moved during a merge: git-mv private/java/src/ private/gatk-private/src/main/java/ git-mv private/R/scripts/ private/gatk-private/src/main/resources/ git-mv private/java/test/ private/gatk-private/src/test/java/ git-mv private/testdata/ private/gatk-private/src/test/resources/ git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/ git-mv private/scala/src/ private/queue-private/src/main/scala/ git-mv protected/java/src/ protected/gatk-protected/src/main/java/ git-mv protected/java/test/ protected/gatk-protected/src/test/java/ git-mv public/java/src/ public/gatk-framework/src/main/java/ git-mv public/java/test/ public/gatk-framework/src/test/java/ git-mv public/testdata/ public/gatk-framework/src/test/resources/ git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/ git-mv public/scala/src/ public/queue-framework/src/main/scala/ git-mv public/scala/test/ public/queue-framework/src/test/scala/	2014-02-03 13:50:44 -05:00
Valentin Ruano-Rubio	89c4e57478	gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested. Changes: ------- <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles. When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake. Integration test MD5 have been changed accordingly. Additional fix: -------------- Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick. In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.	2014-01-30 11:23:33 -05:00
Karthik Gururaj	0c63d6264f	1. Added synchronization block around loadLibrary in VectorLoglessPairHMM 2. Edited Makefile to use static libraries where possible	2014-01-27 15:34:58 -08:00
Karthik Gururaj	85a748860e	1. Added more profiling code 2. Modified JNI_README	2014-01-27 14:32:44 -08:00
Valentin Ruano-Rubio	748d2fdf92	Added Integration test to verify the bugs are not there anymore as reported in pivotracker	2014-01-26 23:29:31 -05:00
Karthik Gururaj	018e9e2c5f	1. Cleaned up code 2. Split into DebugJNILoglessPairHMM and VectorLoglessPairHMM with base class JNILoglessPairHMM. DebugJNILoglessPairHMM can, in principle, invoke any other child class of JNILoglessPairHMM. 3. Added more profiling code for Java parts of LoglessPairHMM	2014-01-26 19:18:12 -08:00
Valentin Ruano-Rubio	9e7bf75e89	Fix for the PairHMM transition probability miscalculation. Problem: matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1. Reports: https://www.pivotaltracker.com/s/projects/793457/stories/62471780 https://www.pivotaltracker.com/s/projects/793457/stories/61082450 Changes: The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies. MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel. Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied small changes in likelihoods that is what is expected.	2014-01-26 16:30:36 -05:00
Karthik Gururaj	81bdfbd00d	Temporary commit before moving to new native library	2014-01-24 16:29:35 -08:00
Karthik Gururaj	936e9e175e	1. Converted q,i,d,c in C++ from int* to char* 2. Use clock_gettime to measure performance 3. Disabled OpenMP 4. Moved LoadTimeInitializer to different file	2014-01-22 22:57:32 -08:00
Karthik Gururaj	733a84e4f9	Added support to transfer haplotypes once per region to the JNI Re-use transferred haplotypes (stored in GlobalRef) across calls to computeLikelihoods	2014-01-22 10:52:41 -08:00
Karthik Gururaj	88c08e78e7	1. Inserted #define in sandbox pairhmm-template-main.cc 2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE 3. OpenMP disabled 4. Added code for initializing PairHMM's data inside initializePairHMM - not used yet	2014-01-21 09:57:14 -08:00
Ryan Poplin	bdd06ebfc2	Merge pull request #478 from broadinstitute/eb_generalize_hc_values_as_args Pulled out some hard-coded values from the read-threading and isActive c...	2014-01-21 09:01:54 -08:00
Karthik Gururaj	7180c392af	1. Integrated Mohammad's SSE4.2 code, Mustafa's bug fix and code to fix the SSE compilation warning. 2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in that order) 3. Created multiple files to compile with different compilation flags: avx_function_prototypes.cc is compiled with -xAVX while sse_function_instantiations.cc is compiled with -xSSE4.2 flag. 4. Added jniClose() and support in Java (HaplotypeCaller, PairHMMLikelihoodCalculationEngine) to call this function at the end of the program. 5. Removed debug code, kept assertions and profiling in C++ 6. Disabled OpenMP for now.	2014-01-20 08:03:42 -08:00
Eric Banks	9e858270d7	Moving this test up one level to where it actually belongs.	2014-01-19 02:33:11 -05:00
Eric Banks	64d5bf650e	Pulled out some hard-coded values from the read-threading and isActive code of the HC, and made them into a single argument. In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity). Very little code actually changes here, but just about every md5 in the HC integration tests are different (as expected). Added another integration test for the new argument. To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?	2014-01-19 01:15:13 -05:00
Karthik Gururaj	25aecb96e0	Added support for dynamic selection between AVX and un-vectorized C++, still to include SSE code from Mohammad. Debug flags turned on in this commit.	2014-01-18 11:07:23 -08:00
Karthik Gururaj	f1c772ceea	Same log message as before - forgot -a option 1. Moved computeLikelihoods from PairHMM to native implementation 2. Disabled debug - debug code still left (hopefully, not part of bytecode) 3. Added directory PairHMM_JNI in the root which holds the C++ library that contains the PairHMM AVX implementation. See PairHMM_JNI/JNI_README first	2014-01-16 21:40:04 -08:00
Karthik Gururaj	e8a5022777	1. Added support for JNI integration for LoglessCaching PairHMM AVX implementation. 2. Contains lots of debug code 3. Only invokes JNI for subComputeReadLikelihoodGivenHaplotypeLog10	2014-01-15 11:07:09 -08:00
Eric Banks	de56134579	Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF. It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now. Since I thought that it might prove useful to others, I moved it to protected and added integration tests. GERALDINE: NEW TOOL ALERT!	2014-01-15 13:49:31 -05:00
Eric Banks	9f1ab0087a	Added in a check for what would be an empty allele after trimming.	2014-01-15 11:04:19 -05:00
Ryan Poplin	201ad398ac	Merge pull request #473 from broadinstitute/eb_fix_qd_indel_normalization The QD normalization for indels was busted and is now fixed.	2014-01-14 08:56:19 -08:00
Eric Banks	e4fdc5ac44	Merge pull request #474 from broadinstitute/eb_fix_haplotype_resolver_PT63333488 Fixing the Haplotype Resolver so that it doesn't complain about missing header lines	2014-01-14 07:36:53 -08:00
Eric Banks	fd511d12a2	Fixing the Haplotype Resolver so that it doesn't complain about missing header lines. The code comments very clearly state that INFO fields shouldn't be propagated into the output, but someone must have accidentally changed it afterwards. This is just a simple one-line fix to make sure the code adhered to the comments. Delivers #63333488.	2014-01-13 22:47:43 -05:00
Geraldine Van der Auwera	8fcad6680b	Assorted fixes and improvements to gatkdocs -Added docs for ERC mode in HC -Move RecalibrationPerformance walker since to private since it is experimental and unsupported -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them -Improved error msg for conflict between per-interval aggregation and -nt -Minor clean up in exception docs -Added Toy Walkers category for devs and dev supercat (to build out docs for developers) -Added more detailed info to GenotypeConcordance doc based on Chris forum post -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples) -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it) -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)	2014-01-13 17:46:22 -05:00
Eric Banks	c7e08965d0	The QD normalization for indels was busted and is now fixed. It is true that indels of length > 1 have higher QUALS than those of length = 1. But for the HC those QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger. So we no longer normalize by indel length (which massively over-penalizes larger events and effectively drops their QD to 0). For the UG the previous normalization also wasn't perfect. Now we divide the indel length by a factor of 3 to make sure that QD is consistent over the range of indel lengths. Integration tests change because QD is different for indels. Also, got permission from Valentin to archive a failing test that no longer applies. Thanks to Kurt on the GATK forum for pointing this all out.	2014-01-13 15:23:36 -05:00
droazen	7cd304fb41	Merge pull request #470 from broadinstitute/mf_new_RBP Mf new rbp	2014-01-13 08:46:27 -08:00
MauricioCarneiro	50cd6781b3	Merge pull request #465 from broadinstitute/eb_improvements_to_ref_confidence_merger Improvements to ref confidence merger	2014-01-08 10:51:01 -08:00
Eric Banks	f172c349f6	Adding the functionality to enable users to input a file of VCFs for -V. To do this I have added a RodBindingCollection which can represent either a VCF or a file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so that one can intermix VCFs and VCF lists. For VariantContext tags with a list, by default the tags for the -V argument are applied unless overridden by the individual line. In other words, any given line can have either one token (the file path) or two tokens (the new tags and the file path). For example: foo.vcf VCF,name=bar bar.vcf Note that a VCF list file name must end with '.list'. Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.	2014-01-08 00:45:00 -05:00
Eric Banks	c133909d32	Fixed edge condition in the realigner where a realigned read can sometimes get partially aligned off the end of the contig. Now we ignore such reads (which is much easier than trying to figure out when to soft-clip). Added unit test.	2014-01-08 00:37:28 -05:00
Menachem Fromer	e33d3dafc6	Add documentation for RBP, and also update the MD5 for the tests now that the output uses HP tags instead of '\|', which is now reserved for trio-based phasing	2014-01-03 12:04:47 -05:00
Menachem Fromer	d1275651ae	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2014-01-03 01:13:40 -05:00
Ryan Poplin	856c1f87c1	Allow for additional input data to be used in the VQSR for clustering but don't carry it forward into the output VCF file. -- New -a argument in the VQSR for specifying additional data to be used in the clustering -- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.	2014-01-02 14:46:04 -05:00
amilev	f81a38f596	Merge pull request #446 from broadinstitute/ami-RNAseq-tools Write a new tool for spliting reads that have N cigar string.	2014-01-01 21:06:25 -08:00
MauricioCarneiro	1223345726	Merge pull request #459 from broadinstitute/eb_fix_bad_hmm_clipping Fixed up edge condition for clipping long reads in the HMM.	2014-01-01 20:00:34 -08:00
Ami Levy-Moonshine	6da53aea09	Write a new tool for spliting reads that have N cigar string. For example, this tool can be used for processing bowtie RNA-seq data. Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases. In order to do it, few changes were introduced to some other clipping methods: - make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I. - change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns create unitTests for that walker: - change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication - move some useful methods from ReadClipperTestUtils to CigarUtils create integration test for that class small change in a comment in FullProcessingPipeline last commit: Address review comments: - move to protected under walkers/rnaseq - change the read splitting methods to be more readable and more efficiant - change (minor changes) some methods in ReadClipper to allow the changes in split reads - add (minor change) one method to CigarUtils to allow the changes in split reads - change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar - address the rest of the review comments (minor changes) - fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion). - add another test to ReadUtilsUnitTest.testReadWithNs - Allow the user to print the split positions (not working proparly currently)	2014-01-01 22:21:36 -05:00
Eric Banks	bb4c4b1fcd	Fixed up edge condition for clipping long reads in the HMM. MD5s change because some reads were incorrectly getting clipped before. [delivers #62584746]	2014-01-01 19:05:09 -05:00
Mauricio Carneiro	d52bd44867	Move CompareBAMs to private This is a tool that we use internally validate the ReduceReads development. I think it should be private. There is no need to improve docs. [delivers #54703398]	2014-01-01 14:33:23 -05:00
Eric Banks	9665f75ad4	Don't fail in annotations if the wrong tools are calling them, just silently skip them. This is important for cases when users want to use annotation groups (like all experimental annotations).	2013-12-31 23:45:21 -05:00
Eric Banks	83e09b1f64	Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline. Basically, it does 3 things (as opposed to having to call into 3 separate walkers): 1. merge the records at any given position into a single one with all alleles and appropriate PLs 2. re-genotype the record using the exact AF calculation model 3. re-annotate the record using the VariantAnnotatorEngine In the course of this work it became clear that we couldn't just use the simpleMerge() method used by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new utility method to handle this merging and pulled any related code out of CombineVariants. I tried to clean up a lot of that code, but ultimately that's out of the scope of this project. Added unit tests for correctness testing. Integration tests cannot be used yet because the HC doesn't output correct gVCFs.	2013-12-31 12:07:56 -05:00
Menachem Fromer	48ef7a1a2f	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2013-12-19 10:42:20 -05:00
Valentin Ruano-Rubio	5db520c6fa	Fixed issue > 0 log likelihoods using GraphBased likelihood engine reported by Mauricio Added some integration test to check on the fix	2013-12-13 11:19:57 -05:00
Eric Banks	ab33db625f	Merge pull request #449 from broadinstitute/eb_move_calc_posteriors_to_protected Moved CalculatePosteriors from private to protected, in preparation for 3.0	2013-12-07 22:18:46 -08:00
Eric Banks	f1970b923e	Moved CalculatePosteriors from private to protected, in preparation for 3.0. Renamed it CalculateGenotypePosteriors. Also, moved the utility code to a proper utility class instead of where Chris left it. No actual code modifications made in this commit.	2013-12-08 00:08:34 -05:00
David Roazen	932cd3ada7	Fix 3rd-party library dependency issues in the HC/PairHMM tests In general, test classes cannot use 3rd-party libraries that are not also dependencies of the GATK proper without causing problems when, at release time, we test that the GATK jar has been packaged correctly with all required dependencies. If a test class needs to use a 3rd-party library that is not a GATK dependency, write wrapper methods in the GATK utils/* classes, and invoke those wrapper methods from the test class.	2013-12-06 13:16:55 -05:00
Eric Banks	e022db4690	Added docs for the minPruning argument in the HC	2013-12-05 11:50:56 -05:00
Geraldine Van der Auwera	3ab2f4edb2	Fixed documentation for -deletions argument in the UAC	2013-12-04 19:55:24 -05:00
amilev	0d94019bd6	Merge pull request #434 from broadinstitute/mc_dt_gccontent Add GC Content to DiagnoseTargets	2013-12-04 09:42:26 -08:00
Joel Thibault	5fe0531b4d	Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy	2013-12-03 23:12:14 -05:00
Mauricio Carneiro	701ede2817	Add GC Content to DiagnoseTargets	2013-12-03 23:04:40 -05:00
Eric Banks	6bee6a1b53	Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles. Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info. Now the PLs and AD get correctly selected down. While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just broke it out into smaller chunks as per our best practices. Added unit tests and updated integration tests. Addressed reviews.	2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio	0f99778a59	Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test	2013-12-02 19:37:19 -05:00
Eric Banks	84ddfb41b5	Merge pull request #438 from broadinstitute/rp_vqsr_num_bad_stability_fixes_and_runtime_optimizations Various VQSR optimizations in runtime and stability.	2013-12-02 08:37:37 -08:00
Ryan Poplin	6a922e7aca	Merge pull request #435 from broadinstitute/eb_fix_ug_bug_for_long_deletions Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.	2013-12-02 08:09:23 -08:00
Ryan Poplin	b57054c63c	Various VQSR optimizations in both runtime and accuracy. -- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model. -- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation. -- Removed the training set quality score threshold. -- Now uses 2 gaussians by default for the negative model. -- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores. -- Model plots are now generated much faster. -- Stricter threshold for determining model convergence. -- All VQSR integration tests change because of these changes to the model. -- Add test for downsampling of training data.	2013-11-29 13:04:46 -05:00
Eric Banks	df6499e58c	Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int. For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing the min MQ filter. Added unit test. Delivers PT#61567540.	2013-11-27 18:55:03 -05:00
Eric Banks	51d1a26725	Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads. His code was excessively clipping reads because it was looking at their cigar string instead of just the read length. This meant that it was basically impossible to call large deletions in UG even with perfect evidence in the reads (as reported by Craig D). Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one being genotyped used to be given too much likelihood and now give less. Added unit tests for new methods.	2013-11-27 13:54:39 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Eric Banks	0fac4fb3b6	Make the reference model calculation work with reduced reads. It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.	2013-11-21 10:53:33 -05:00
Eric Banks	adb77b406f	Fixed poor implementation of isRefSource() and isRefSink() among others. There was already a note in the code about how wrong the implementation was. The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails. Delivers PT #61069820.	2013-11-21 10:53:27 -05:00
Ami Levy-Moonshine	e6ef37de1d	Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality	2013-11-18 17:29:32 -05:00
MauricioCarneiro	7f08250870	Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2 Improve the PairHMM API for better FPGA integration	2013-11-14 10:47:07 -08:00
bradtaylor	e40a07bb58	Improve the PairHMM API for better FPGA integration Motivation: The API was different between the regular PairHMM and the FPGA-implementation via CnyPairHMM. As a result, the LikelihoodCalculationEngine had to use account for this. The goal is to change the API to be the same for all implementations, and make it easier to access. PairHMM PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap. Added a new primary method that loops the reads and haplotypes, extracts qualities, and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method. Did not alter that method, or its subcompute method, at all. PairHMM also now handles its own (re)initialization, so users don't have to worry about that. CnyPairHMM Added that same new primary access method to this FPGA class. Method overrides the default implementation in PairHMM. Walks through a list of reads. Individual-read quals and the full haplotype list are fed to batchAdd(), as before. However, instead of waiting for every read to get added, and then walking through the reads again to extract results, we just get the haplotype-results array for each read as soon as it is generated, and pack it into a perReadAlleleLikelihoodMap for return. The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not. LikelihoodCalculationEngine The functionality to loop through the reads and haplotypes and get individual log10-likelihoods was moved to the PairHMM, and so removed from here. However, this class does need to retain the ability to pre-process the reads, and post-process the resulting likelihoods map. Those features were separated from running the HMM and refactored into their own methods Commented out the (unused) system for finding best N haplotypes for genotyping. PairHMMIndelErrorModel Similar changes were made as to the LCE. However, in this case the haplotypes are modified based on each individual read, so the read-list we feed into the HMM only has one read.	2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera	dac3dbc997	Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle Clarified caveat for InbreedingCoefficient Cleaned up docstrings for ReduceReads Brushed up doc for ErrorRatePerCycle	2013-11-13 14:33:04 -05:00
Eric Banks	0e3d83d1ef	Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline Improvements to the reference model pipeline.	2013-11-05 06:33:17 -08:00
Eric Banks	09dfaf1a68	Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline Add interpretation to QualifyMissingIntervals	2013-11-05 06:08:13 -08:00
Ryan Poplin	b22c9c2cb4	Improvements to the reference model pipeline. -- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants) -- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.	2013-11-01 17:58:25 -04:00
Mauricio Carneiro	5ed47988b8	Changed the parameter names from cds to baits Making the usage more clear since the parameter is being used over and over to define baited regions. Updated the headers accordingly and made it more readable.	2013-10-24 17:15:56 -04:00

1 2 3 4 5 ...

1035 Commits (4e74e77e746e79fee8eaabb2a5f9f8a62c3a5700)