gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Valentin Ruano-Rubio	69bf2b3247	Added a more efficient implementation of the KBest haplotype finder code (CONT.) Changes: 1. Addressed review comments on new K-best haplotype assembly graph finder. 2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices. 3. Updated test to use KBestHaplotypeFinder instead of KBestPaths 4. Retired KBestPaths to the archive. 5. Small improvements to the code and documentation.	2014-03-04 23:22:27 -05:00
Valentin Ruano-Rubio	7acf2eb0e7	Added a more efficient implementation of the KBest haplotype finder code. Story: https://www.pivotaltracker.com/story/show/66238286 Changes: 1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder. 2. Changed HC code to use the new implementation. This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases 3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code. In any case that seems like a more natural location for that functionality.	2014-03-04 12:22:14 -05:00
Eric Banks	b99bf85ec8	Fixed bug where dangling tail merging occasionally created a cycle in the graph. Added unit tests to cover this case. Delivers PT#66690470.	2014-03-03 22:42:56 -05:00
Eric Banks	4d69af189e	Minor change: make the --dontUseSoftClippedBases @Advanced instead of @Hidden	2014-03-03 15:59:32 -05:00
Eric Banks	fa65716fe9	Added code to retrieve dangling heads from the read threading graph (previously we were rescuing just the tails). The purpose of this is to be able to call SNPs that fall at the beginning of a capture region (or exon). Before, the read threading code would only start threading from the first kmer that matched the reference. But that means that, in the case of a SNP at the beginning of an exome, it wouldn't start threading the read until after the SNP position - so we'd lose the SNP. For now, this is still very experimental. It works well for RNAseq data, but does introduce FPs in normal exomes. I know why this is and how to fix it, but it requires a much larger fix to the HC: the HC needs to pass all reads and bases to the annotation engine (like UG does) instead of just the high quality ones. So for now, the head merging is disabled by default. As per reviewer comments, I moved the head and tail merging code out into their own class.	2014-03-03 15:59:26 -05:00
amilev	cecdd2f2c5	Merge pull request #539 from broadinstitute/eb_hard_clip_exon_overhangs_for_ami Add the capability to the N-cigar splitter to also hard-clip off overhan...	2014-03-03 12:23:11 -05:00
Eric Banks	6c872308d8	Add the capability to the N-cigar splitter to also hard-clip off overhangs based on observed split positions. We use a "manager" to keep track of observed splits and previous reads. This can be extended/modified in the future to try to salvage those overhangs instead of hard-clipping them and/or try other possible strategies. Added unit tests and more integration tests.	2014-03-02 21:10:34 -05:00
Eric Banks	22ad18b919	Moving Reduce Reads to the archive. The GATK now fails with a user error if you try to run with a reduced bam. (I added a unit test for that; everything else here is just the removal of all traces of RR)	2014-03-02 02:03:14 -05:00
Karthik Gururaj	1b395a871a	1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java 2. Committing the right set of files after rebase	2014-02-28 16:08:28 -08:00
Karthik Gururaj	37526dfad5	1. Added the catch UnsatisfiedLinkError exception in PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING in case the native library could not be loaded. Made VECTOR_LOGLESS_CACHING as the default implementation. 2. Updated the README with Mauricio's comments 3. baseline.cc is used within the library - if the machine supports neither AVX nor SSE4.1, the native library falls back to un-vectorized C++ in baseline.cc. 4. pairhmm-1-base.cc: This is not part of the library, but is being heavily used for debugging/profiling. Can I request that we keep it there for now? In the next release, we can delete it from the repository. 5. I agree with Mauricio about the ifdefs. I am sure you already know, but just to reassure you the debug code is not compiled into the library (because of the ifdefs) and will not affect performance.	2014-02-28 08:59:55 -08:00
Karthik Gururaj	0fe843bfd9	Followed Khalid's suggestion for packing libVectorLoglessCaching into the jar file with Maven	2014-02-26 11:47:42 -08:00
Karthik Gururaj	15fe244e4b	Now has PAPI values	2014-02-26 11:47:42 -08:00
Intel Repocontact	e32e9e6af6	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2014-02-26 11:47:01 -08:00
Intel Repocontact	ff2a972ab5	Merge branch 'master' of github.com:broadinstitute/gsa-unstable Conflicts: .gitignore	2014-02-25 20:56:28 -08:00
Khalid Shakir	f02ce6eca7	Added tests for cleaning up scattered .bai files, and using the log directory. Re-added import java.io.File for BamGatherFunction. Other cleanup to resolve scala syntax warnings from intellij. Moved Example UG script to from protected to public.	2014-02-26 02:11:28 +08:00
Eric Banks	0f30df0356	Stopgap procedure to rescue Fisher Strand for cases where there's lots of data. This commit consists of 2 main changes: 1. When the strand table gets too large, we normalize it down to values that are more reasonable. 2. We don't include a particular sample's contribution unless the total ref and alt counts are at least 2 each; this is a heuristic method for dealing only with hets. MD5s change as expected. Hopefully we'll have a more robust implementation for GATK 3.1.	2014-02-25 01:04:27 -05:00
droazen	e8ea9f58d3	Merge pull request #531 from broadinstitute/ks_build_patches Build patches	2014-02-24 15:13:16 -05:00
Valentin Ruano-Rubio	0b3a70b8c1	Fix for a bug a bug in (Assembly Graph) Routes. The slicePrefix method functionality was broken. Story: https://www.pivotaltracker.com/story/show/64595624 Changes: 1. Fixed the bug. 2. Added unit test to check on the method functionality. 3. Added a integration test to verify the bug has been fixed in a empirical data reprudible case.	2014-02-24 10:54:39 -05:00
Khalid Shakir	7e516b294f	Replaced local drmaa and Jama artifacts with versions from maven central. Removed unused caliper binary from local repo.	2014-02-22 01:21:35 +08:00
Valentin Ruano-Rubio	463af7143f	Activate reverse allele trimming in GVCF Story: https://www.pivotaltracker.com/s/projects/1007536 Changes: 1. HC's GenotypingEngine now invokes reverseAlleleTrimming on GVCF variant output lines. 2. GenotypeGVCFs also reverse trim after regenotyping as some alt. alleles are dropped (observed in real-data).	2014-02-20 03:17:24 -05:00
Eric Banks	53a7d5cbae	Fixing a bug in the GVCF writer. The writer was never resetting the pointer to the end of the last non-ref VariantContext that it saw. This was fine except when it jumped to a new contig - and a lower position on that contig - where it thought that it was still part of that previous non-ref VariantContext so wouldn't emit a reference block. Therefore, ref blocks were missing from the beginnings of all chromosomes (except chr1). Added unit test to cover this case.	2014-02-20 02:33:43 -05:00
Valentin Ruano-Rubio	c167fb5fdf	Fixing GenotypesGVCF. Bug uncovered by some untrimmed alleles in the single sample pipeline output. Notice however does not fix the untrimmed alleles in general. Story: https://www.pivotaltracker.com/story/show/65481104 Changes: 1. Fixed the bug itself. 2. Fixed non-working tests (sliently skipped due to exception in dataProvider).	2014-02-19 14:20:39 -05:00
Ryan Poplin	43c20264b0	Initial commit of the random forest classifier.	2014-02-17 13:07:27 -05:00
droazen	688792c5b0	Merge pull request #520 from broadinstitute/jt_fix_failing_tests_post_maven Fix for the Array Out of Bounds test error	2014-02-14 14:02:17 -05:00
Eric Banks	3724d4e5f3	Various small fixes for CalculateGenotypePosteriors based on feedback from guys in Ben Neale's group. Note that this tool is still a work in progress and very experimental, so isn't 100% stable. Most of the features are untested (both by people and by unit/integration tests) because Chris Hartl implemented it right before he left, and we're going to need to add tests at some point soon. I added a first integration test in this commit, but it's just a start. The fixes include: 1. Stop having the genotyping code strip out AD values. It doesn't make sense that it should do this so I don't know why it was doing that at all. Updated GenotypeGVCFs so that it doesn't need to manually recover them anymore. This also helps CalculateGenotypePosteriors which was losing the AD values. Updated code in LeftAlignAndTrimVariants to strip out PLs and AD, since it wasn't doing that before. Updated the integration test for that walker to include such data. 2. Chris was calling Math.pow directly on the normalized posteriors which isn't safe. Instead, the normalization routine itself can revert back to log scale in a safe manner so let's use it. Also, renamed the variable to posteriorProbabilities (and not likelihoods). 3. Have CGP update the AC/AF/AN counts after fixing GTs.	2014-02-14 13:48:14 -05:00
Joel Thibault	cb7ad01202	Re-enable the relevant tests	2014-02-14 12:34:08 -05:00
Joel Thibault	c8a5007c85	Add a comment to the method where the error appears	2014-02-14 11:40:22 -05:00
Joel Thibault	ec16439387	Clear the ReadCovariates keysCache before runs of individual Unit Tests - normal runs have a constant covariate count, so this is not necessary	2014-02-14 10:41:28 -05:00
Eric Banks	7095a60c8e	Merge pull request #516 from broadinstitute/dr_reenable_tests_failing_due_to_java_update Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation	2014-02-13 21:05:18 -05:00
David Roazen	4b4b93ad1b	Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation After extensive detective work, Joel determined that these tests were failing due to changes in the implementation of Math.pow() in newer versions of Java 1.7. All GSA members should ensure that they're using a JDK that is at least as current as the one in the Java-1.7 dotkit on the Broad servers (build 1.7.0_51-b13).	2014-02-12 16:08:16 -05:00
Joel Thibault	cc9477aedb	Minimal test for the multi-allelic reordering bug	2014-02-12 13:38:32 -05:00
Eric Banks	300b474c96	Several improvements to the single sample combining steps. 1. updated QualByDepth not to use AD-restricted depth if it is zero. Added unit test this change. 2. Fixed small bug in CombineGVCFs where spanning deletions were not being treated consistently throughout. Added test for this situation. 3. Make sure GenotypeGVCFs puts in the required headers. Updated test files to make sure this is covered. 4. Have GenotypeGVCFs propagate up the MLEAC/AF (which were getting clobbered out). Tests updated to account for this.	2014-02-12 10:15:12 -05:00
David Roazen	95e1402d21	Add ability to run *KnowledgeBaseTests to maven Run with: mvn verify -Dsting.knowledgebasetests.skipped=false	2014-02-11 14:08:24 -05:00
Eric Banks	303a60c8c6	Adding smarts to the QD annotation: when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Added new unit tests for QualByDepth.	2014-02-11 12:56:49 -05:00
Eric Banks	2e36dd9001	Refactoring of CombineGVCFs to make it run a lot faster. Creating new VariantContexts each time we broke up a block was very expensive because we break up blocks so often. Also, calling into GATKVariantContextUtils.simpleMerge was really hurting performance. MD5 changes because we no longer propogate any INFO fields (except for END) for reference blocks; the tests have the now unused BLOCK_SIZE field that now get dropped.	2014-02-11 03:18:52 -05:00
Eric Banks	abef6cfcb6	Removing parameters that were incorrectly copied over from RegenotypeVariants.	2014-02-08 23:44:32 -05:00
Eric Banks	659a9f0e79	Removing the test for BLOCK_SIZE since we no longer emit it	2014-02-08 21:28:07 -05:00
Valentin Ruano-Rubio	bf630abe88	Fixed nocall (./.) without PLs bug in GVCF output Story: https://www.pivotaltracker.com/story/show/65388246 Additional changes and notes: 1. The fix consist in forcing the output of all PLs by setting the standard flag for that '-allSitePLs'. 2. BP_RESOLUTION was handled differently to GVCF in some aspect that should be common. That has been fixed.	2014-02-07 19:30:26 -05:00
Karthik Gururaj	20a46e4098	Check only for SSE 4.1 (rather than SSE 4.2) when trying to use the SSE implementation of PairHMM	2014-02-07 15:19:55 -08:00
Karthik Gururaj	dc44b64ad8	1. Added support for building the PairHMM vector library into build.xml. The library is compiled using makefile and copied into the directory: build/java/classes/org/broadinstitute/sting/utils/pairhmm/ 2. Bundled the library into StingUtils.jar. Unpacked and loaded at runtime without the need to set java.library.path Caveats: Platform independence has probably been thrown out of the window. Assumptions: a. make command exists at /usr/bin/make b. rsync command exists at /usr/bin/rsync c. icc is in the PATH of the user	2014-02-07 13:13:59 -08:00
Eric Banks	d689f61005	Fixed up some of the genotype-level annotations being propogated in the single sample HC pipeline. 1. AD values now propogate up (they weren't before). 2. MIN_DP gets transferred over to DP and removed. 3. SB gets removed after FS is calculated. Also, added a bunch of new integration tests for GenotypeGVCFs.	2014-02-07 12:47:54 -05:00
Eric Banks	67ed0d2403	The UG engine can return a null VC if there are tons of alt alleles, causing Tim's merge jobs to fail. Pushing the null check up so that it doesn't error out in such cases.	2014-02-07 12:41:20 -05:00
Valentin Ruano-Rubio	4a3c8e68fa	Fixed out of order non-variant gVCF entries when trimming is active. Story: https://www.pivotaltracker.com/story/show/65319564	2014-02-07 11:03:26 -05:00
Eric Banks	eb463b505d	Remove a whole bunch of unused annotations from gVCF output. AC,AF,AN,FS,QD - they'll all be recomputed later. BLOCK_SIZE and MIN_GQ were not necessary. I also made the StrandBiasBySample annotation forced on when in gVCF mode. It turns out that its output wasn't compatible with BCF so I patched it (and the variant jar too).	2014-02-07 08:49:36 -05:00
Eric Banks	2648219c42	Implementation of a hierarchical merger for gVCFs, called CombineGVCFs. This tool will take any number of gVCFs and create a merged gVCF (as opposed to GenotypeGVCFs which produces a standard VCF). Added unit/integration tests and fixed up GATK docs.	2014-02-07 08:49:18 -05:00
Eric Banks	71b47a6148	Rename CombineReferenceCalculationVariants to GenotypeGVCFs	2014-02-06 15:46:19 -05:00
Khalid Shakir	3848159086	Added a set of serial tests to gatk/queue packages, which runs all tests under their package in one TestNG execution. New properties to disable regenerating example resources artifact when each parallel test runs under packagetest. Moved collection of packagetest parameters from shell scripts into maven profiles. Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests. When building picard libraries, run clean first. Fixed tools jar dependency in picard pom. Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.	2014-02-06 08:25:38 -05:00
Valentin Ruano Rubio	988e3b4890	Merge pull request #487 from broadinstitute/vrr_reference_model_with_trimming Get gVCF to work without --dontTrimActiveRegions	2014-02-05 22:52:17 -05:00
Valentin Ruano-Rubio	98ffcf6833	Get gVCF to work without --dontTrimActiveRegions Story: https://www.pivotaltracker.com/story/show/65048706 https://www.pivotaltracker.com/story/show/65116908 Changes: ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet. Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs	2014-02-05 22:50:45 -05:00
Ryan Poplin	693bfac341	Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places. -- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests	2014-02-05 12:58:48 -05:00
Eric Banks	91bdf069d3	Some updates to CRCV. 1. Throw a user error when the input data for a given genotype does not contain PLs. 2. Add VCF header line for --dbsnp input 3. Need to check that the UG result is not null 4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)	2014-02-05 10:12:37 -05:00
Joel Thibault	9eaee8c73c	Integration test for the -nt race condition corrupting AD and PL fields	2014-02-04 22:04:27 -05:00
David Roazen	1de7a27471	Disable an additional test that is runtime dependent on one of the temporarily-disabled tests	2014-02-04 16:07:58 -05:00
David Roazen	76086f30b7	Temporarily disable tests that started failing post-maven Joel is working on these failures in a separate branch. Since maven (currently! we're working on this..) won't run the whole test suite to completion if there's a failure early on, we need to temporarily disable these tests in order to allow group members to run tests on their branches again.	2014-02-04 15:31:24 -05:00
Khalid Shakir	857e6e0d6f	Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script.	2014-02-03 13:50:46 -05:00
Khalid Shakir	9ca3004fc3	Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency.	2014-02-03 13:50:46 -05:00
Khalid Shakir	de13f41fc3	One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules.	2014-02-03 13:50:46 -05:00
Khalid Shakir	caa76cdac4	Added maven pom.xmls for various artifacts.	2014-02-03 13:50:46 -05:00
Khalid Shakir	1e25a758f5	Moved files to maven directories. Here are the git moved directories in case other files need to be moved during a merge: git-mv private/java/src/ private/gatk-private/src/main/java/ git-mv private/R/scripts/ private/gatk-private/src/main/resources/ git-mv private/java/test/ private/gatk-private/src/test/java/ git-mv private/testdata/ private/gatk-private/src/test/resources/ git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/ git-mv private/scala/src/ private/queue-private/src/main/scala/ git-mv protected/java/src/ protected/gatk-protected/src/main/java/ git-mv protected/java/test/ protected/gatk-protected/src/test/java/ git-mv public/java/src/ public/gatk-framework/src/main/java/ git-mv public/java/test/ public/gatk-framework/src/test/java/ git-mv public/testdata/ public/gatk-framework/src/test/resources/ git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/ git-mv public/scala/src/ public/queue-framework/src/main/scala/ git-mv public/scala/test/ public/queue-framework/src/test/scala/	2014-02-03 13:50:44 -05:00
Valentin Ruano-Rubio	89c4e57478	gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested. Changes: ------- <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles. When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake. Integration test MD5 have been changed accordingly. Additional fix: -------------- Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick. In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.	2014-01-30 11:23:33 -05:00
Karthik Gururaj	0c63d6264f	1. Added synchronization block around loadLibrary in VectorLoglessPairHMM 2. Edited Makefile to use static libraries where possible	2014-01-27 15:34:58 -08:00
Karthik Gururaj	85a748860e	1. Added more profiling code 2. Modified JNI_README	2014-01-27 14:32:44 -08:00
Valentin Ruano-Rubio	748d2fdf92	Added Integration test to verify the bugs are not there anymore as reported in pivotracker	2014-01-26 23:29:31 -05:00
Karthik Gururaj	018e9e2c5f	1. Cleaned up code 2. Split into DebugJNILoglessPairHMM and VectorLoglessPairHMM with base class JNILoglessPairHMM. DebugJNILoglessPairHMM can, in principle, invoke any other child class of JNILoglessPairHMM. 3. Added more profiling code for Java parts of LoglessPairHMM	2014-01-26 19:18:12 -08:00
Valentin Ruano-Rubio	9e7bf75e89	Fix for the PairHMM transition probability miscalculation. Problem: matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1. Reports: https://www.pivotaltracker.com/s/projects/793457/stories/62471780 https://www.pivotaltracker.com/s/projects/793457/stories/61082450 Changes: The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies. MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel. Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied small changes in likelihoods that is what is expected.	2014-01-26 16:30:36 -05:00
Karthik Gururaj	81bdfbd00d	Temporary commit before moving to new native library	2014-01-24 16:29:35 -08:00
Karthik Gururaj	936e9e175e	1. Converted q,i,d,c in C++ from int* to char* 2. Use clock_gettime to measure performance 3. Disabled OpenMP 4. Moved LoadTimeInitializer to different file	2014-01-22 22:57:32 -08:00
Karthik Gururaj	733a84e4f9	Added support to transfer haplotypes once per region to the JNI Re-use transferred haplotypes (stored in GlobalRef) across calls to computeLikelihoods	2014-01-22 10:52:41 -08:00
Karthik Gururaj	88c08e78e7	1. Inserted #define in sandbox pairhmm-template-main.cc 2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE 3. OpenMP disabled 4. Added code for initializing PairHMM's data inside initializePairHMM - not used yet	2014-01-21 09:57:14 -08:00
Ryan Poplin	bdd06ebfc2	Merge pull request #478 from broadinstitute/eb_generalize_hc_values_as_args Pulled out some hard-coded values from the read-threading and isActive c...	2014-01-21 09:01:54 -08:00
Karthik Gururaj	7180c392af	1. Integrated Mohammad's SSE4.2 code, Mustafa's bug fix and code to fix the SSE compilation warning. 2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in that order) 3. Created multiple files to compile with different compilation flags: avx_function_prototypes.cc is compiled with -xAVX while sse_function_instantiations.cc is compiled with -xSSE4.2 flag. 4. Added jniClose() and support in Java (HaplotypeCaller, PairHMMLikelihoodCalculationEngine) to call this function at the end of the program. 5. Removed debug code, kept assertions and profiling in C++ 6. Disabled OpenMP for now.	2014-01-20 08:03:42 -08:00
Eric Banks	9e858270d7	Moving this test up one level to where it actually belongs.	2014-01-19 02:33:11 -05:00
Eric Banks	64d5bf650e	Pulled out some hard-coded values from the read-threading and isActive code of the HC, and made them into a single argument. In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity). Very little code actually changes here, but just about every md5 in the HC integration tests are different (as expected). Added another integration test for the new argument. To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?	2014-01-19 01:15:13 -05:00
Karthik Gururaj	25aecb96e0	Added support for dynamic selection between AVX and un-vectorized C++, still to include SSE code from Mohammad. Debug flags turned on in this commit.	2014-01-18 11:07:23 -08:00
Karthik Gururaj	f1c772ceea	Same log message as before - forgot -a option 1. Moved computeLikelihoods from PairHMM to native implementation 2. Disabled debug - debug code still left (hopefully, not part of bytecode) 3. Added directory PairHMM_JNI in the root which holds the C++ library that contains the PairHMM AVX implementation. See PairHMM_JNI/JNI_README first	2014-01-16 21:40:04 -08:00
Karthik Gururaj	e8a5022777	1. Added support for JNI integration for LoglessCaching PairHMM AVX implementation. 2. Contains lots of debug code 3. Only invokes JNI for subComputeReadLikelihoodGivenHaplotypeLog10	2014-01-15 11:07:09 -08:00
Eric Banks	de56134579	Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF. It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now. Since I thought that it might prove useful to others, I moved it to protected and added integration tests. GERALDINE: NEW TOOL ALERT!	2014-01-15 13:49:31 -05:00
Eric Banks	9f1ab0087a	Added in a check for what would be an empty allele after trimming.	2014-01-15 11:04:19 -05:00
Ryan Poplin	201ad398ac	Merge pull request #473 from broadinstitute/eb_fix_qd_indel_normalization The QD normalization for indels was busted and is now fixed.	2014-01-14 08:56:19 -08:00
Eric Banks	e4fdc5ac44	Merge pull request #474 from broadinstitute/eb_fix_haplotype_resolver_PT63333488 Fixing the Haplotype Resolver so that it doesn't complain about missing header lines	2014-01-14 07:36:53 -08:00
Eric Banks	fd511d12a2	Fixing the Haplotype Resolver so that it doesn't complain about missing header lines. The code comments very clearly state that INFO fields shouldn't be propagated into the output, but someone must have accidentally changed it afterwards. This is just a simple one-line fix to make sure the code adhered to the comments. Delivers #63333488.	2014-01-13 22:47:43 -05:00
Geraldine Van der Auwera	8fcad6680b	Assorted fixes and improvements to gatkdocs -Added docs for ERC mode in HC -Move RecalibrationPerformance walker since to private since it is experimental and unsupported -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them -Improved error msg for conflict between per-interval aggregation and -nt -Minor clean up in exception docs -Added Toy Walkers category for devs and dev supercat (to build out docs for developers) -Added more detailed info to GenotypeConcordance doc based on Chris forum post -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples) -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it) -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)	2014-01-13 17:46:22 -05:00
Eric Banks	c7e08965d0	The QD normalization for indels was busted and is now fixed. It is true that indels of length > 1 have higher QUALS than those of length = 1. But for the HC those QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger. So we no longer normalize by indel length (which massively over-penalizes larger events and effectively drops their QD to 0). For the UG the previous normalization also wasn't perfect. Now we divide the indel length by a factor of 3 to make sure that QD is consistent over the range of indel lengths. Integration tests change because QD is different for indels. Also, got permission from Valentin to archive a failing test that no longer applies. Thanks to Kurt on the GATK forum for pointing this all out.	2014-01-13 15:23:36 -05:00
droazen	7cd304fb41	Merge pull request #470 from broadinstitute/mf_new_RBP Mf new rbp	2014-01-13 08:46:27 -08:00
MauricioCarneiro	50cd6781b3	Merge pull request #465 from broadinstitute/eb_improvements_to_ref_confidence_merger Improvements to ref confidence merger	2014-01-08 10:51:01 -08:00
Eric Banks	f172c349f6	Adding the functionality to enable users to input a file of VCFs for -V. To do this I have added a RodBindingCollection which can represent either a VCF or a file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so that one can intermix VCFs and VCF lists. For VariantContext tags with a list, by default the tags for the -V argument are applied unless overridden by the individual line. In other words, any given line can have either one token (the file path) or two tokens (the new tags and the file path). For example: foo.vcf VCF,name=bar bar.vcf Note that a VCF list file name must end with '.list'. Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.	2014-01-08 00:45:00 -05:00
Eric Banks	c133909d32	Fixed edge condition in the realigner where a realigned read can sometimes get partially aligned off the end of the contig. Now we ignore such reads (which is much easier than trying to figure out when to soft-clip). Added unit test.	2014-01-08 00:37:28 -05:00
Menachem Fromer	e33d3dafc6	Add documentation for RBP, and also update the MD5 for the tests now that the output uses HP tags instead of '\|', which is now reserved for trio-based phasing	2014-01-03 12:04:47 -05:00
Menachem Fromer	d1275651ae	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2014-01-03 01:13:40 -05:00
Ryan Poplin	856c1f87c1	Allow for additional input data to be used in the VQSR for clustering but don't carry it forward into the output VCF file. -- New -a argument in the VQSR for specifying additional data to be used in the clustering -- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.	2014-01-02 14:46:04 -05:00
amilev	f81a38f596	Merge pull request #446 from broadinstitute/ami-RNAseq-tools Write a new tool for spliting reads that have N cigar string.	2014-01-01 21:06:25 -08:00
MauricioCarneiro	1223345726	Merge pull request #459 from broadinstitute/eb_fix_bad_hmm_clipping Fixed up edge condition for clipping long reads in the HMM.	2014-01-01 20:00:34 -08:00
Ami Levy-Moonshine	6da53aea09	Write a new tool for spliting reads that have N cigar string. For example, this tool can be used for processing bowtie RNA-seq data. Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases. In order to do it, few changes were introduced to some other clipping methods: - make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I. - change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns create unitTests for that walker: - change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication - move some useful methods from ReadClipperTestUtils to CigarUtils create integration test for that class small change in a comment in FullProcessingPipeline last commit: Address review comments: - move to protected under walkers/rnaseq - change the read splitting methods to be more readable and more efficiant - change (minor changes) some methods in ReadClipper to allow the changes in split reads - add (minor change) one method to CigarUtils to allow the changes in split reads - change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar - address the rest of the review comments (minor changes) - fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion). - add another test to ReadUtilsUnitTest.testReadWithNs - Allow the user to print the split positions (not working proparly currently)	2014-01-01 22:21:36 -05:00
Eric Banks	bb4c4b1fcd	Fixed up edge condition for clipping long reads in the HMM. MD5s change because some reads were incorrectly getting clipped before. [delivers #62584746]	2014-01-01 19:05:09 -05:00
Mauricio Carneiro	d52bd44867	Move CompareBAMs to private This is a tool that we use internally validate the ReduceReads development. I think it should be private. There is no need to improve docs. [delivers #54703398]	2014-01-01 14:33:23 -05:00
Eric Banks	9665f75ad4	Don't fail in annotations if the wrong tools are calling them, just silently skip them. This is important for cases when users want to use annotation groups (like all experimental annotations).	2013-12-31 23:45:21 -05:00
Eric Banks	83e09b1f64	Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline. Basically, it does 3 things (as opposed to having to call into 3 separate walkers): 1. merge the records at any given position into a single one with all alleles and appropriate PLs 2. re-genotype the record using the exact AF calculation model 3. re-annotate the record using the VariantAnnotatorEngine In the course of this work it became clear that we couldn't just use the simpleMerge() method used by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new utility method to handle this merging and pulled any related code out of CombineVariants. I tried to clean up a lot of that code, but ultimately that's out of the scope of this project. Added unit tests for correctness testing. Integration tests cannot be used yet because the HC doesn't output correct gVCFs.	2013-12-31 12:07:56 -05:00
Menachem Fromer	48ef7a1a2f	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2013-12-19 10:42:20 -05:00
Valentin Ruano-Rubio	5db520c6fa	Fixed issue > 0 log likelihoods using GraphBased likelihood engine reported by Mauricio Added some integration test to check on the fix	2013-12-13 11:19:57 -05:00
Eric Banks	ab33db625f	Merge pull request #449 from broadinstitute/eb_move_calc_posteriors_to_protected Moved CalculatePosteriors from private to protected, in preparation for 3.0	2013-12-07 22:18:46 -08:00
Eric Banks	f1970b923e	Moved CalculatePosteriors from private to protected, in preparation for 3.0. Renamed it CalculateGenotypePosteriors. Also, moved the utility code to a proper utility class instead of where Chris left it. No actual code modifications made in this commit.	2013-12-08 00:08:34 -05:00
David Roazen	932cd3ada7	Fix 3rd-party library dependency issues in the HC/PairHMM tests In general, test classes cannot use 3rd-party libraries that are not also dependencies of the GATK proper without causing problems when, at release time, we test that the GATK jar has been packaged correctly with all required dependencies. If a test class needs to use a 3rd-party library that is not a GATK dependency, write wrapper methods in the GATK utils/* classes, and invoke those wrapper methods from the test class.	2013-12-06 13:16:55 -05:00
Eric Banks	e022db4690	Added docs for the minPruning argument in the HC	2013-12-05 11:50:56 -05:00
Geraldine Van der Auwera	3ab2f4edb2	Fixed documentation for -deletions argument in the UAC	2013-12-04 19:55:24 -05:00
amilev	0d94019bd6	Merge pull request #434 from broadinstitute/mc_dt_gccontent Add GC Content to DiagnoseTargets	2013-12-04 09:42:26 -08:00
Joel Thibault	5fe0531b4d	Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy	2013-12-03 23:12:14 -05:00
Mauricio Carneiro	701ede2817	Add GC Content to DiagnoseTargets	2013-12-03 23:04:40 -05:00
Eric Banks	6bee6a1b53	Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles. Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info. Now the PLs and AD get correctly selected down. While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just broke it out into smaller chunks as per our best practices. Added unit tests and updated integration tests. Addressed reviews.	2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio	0f99778a59	Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test	2013-12-02 19:37:19 -05:00
Eric Banks	84ddfb41b5	Merge pull request #438 from broadinstitute/rp_vqsr_num_bad_stability_fixes_and_runtime_optimizations Various VQSR optimizations in runtime and stability.	2013-12-02 08:37:37 -08:00
Ryan Poplin	6a922e7aca	Merge pull request #435 from broadinstitute/eb_fix_ug_bug_for_long_deletions Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.	2013-12-02 08:09:23 -08:00
Ryan Poplin	b57054c63c	Various VQSR optimizations in both runtime and accuracy. -- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model. -- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation. -- Removed the training set quality score threshold. -- Now uses 2 gaussians by default for the negative model. -- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores. -- Model plots are now generated much faster. -- Stricter threshold for determining model convergence. -- All VQSR integration tests change because of these changes to the model. -- Add test for downsampling of training data.	2013-11-29 13:04:46 -05:00
Eric Banks	df6499e58c	Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int. For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing the min MQ filter. Added unit test. Delivers PT#61567540.	2013-11-27 18:55:03 -05:00
Eric Banks	51d1a26725	Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads. His code was excessively clipping reads because it was looking at their cigar string instead of just the read length. This meant that it was basically impossible to call large deletions in UG even with perfect evidence in the reads (as reported by Craig D). Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one being genotyped used to be given too much likelihood and now give less. Added unit tests for new methods.	2013-11-27 13:54:39 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Eric Banks	0fac4fb3b6	Make the reference model calculation work with reduced reads. It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.	2013-11-21 10:53:33 -05:00
Eric Banks	adb77b406f	Fixed poor implementation of isRefSource() and isRefSink() among others. There was already a note in the code about how wrong the implementation was. The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails. Delivers PT #61069820.	2013-11-21 10:53:27 -05:00
Ami Levy-Moonshine	e6ef37de1d	Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality	2013-11-18 17:29:32 -05:00
MauricioCarneiro	7f08250870	Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2 Improve the PairHMM API for better FPGA integration	2013-11-14 10:47:07 -08:00
bradtaylor	e40a07bb58	Improve the PairHMM API for better FPGA integration Motivation: The API was different between the regular PairHMM and the FPGA-implementation via CnyPairHMM. As a result, the LikelihoodCalculationEngine had to use account for this. The goal is to change the API to be the same for all implementations, and make it easier to access. PairHMM PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap. Added a new primary method that loops the reads and haplotypes, extracts qualities, and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method. Did not alter that method, or its subcompute method, at all. PairHMM also now handles its own (re)initialization, so users don't have to worry about that. CnyPairHMM Added that same new primary access method to this FPGA class. Method overrides the default implementation in PairHMM. Walks through a list of reads. Individual-read quals and the full haplotype list are fed to batchAdd(), as before. However, instead of waiting for every read to get added, and then walking through the reads again to extract results, we just get the haplotype-results array for each read as soon as it is generated, and pack it into a perReadAlleleLikelihoodMap for return. The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not. LikelihoodCalculationEngine The functionality to loop through the reads and haplotypes and get individual log10-likelihoods was moved to the PairHMM, and so removed from here. However, this class does need to retain the ability to pre-process the reads, and post-process the resulting likelihoods map. Those features were separated from running the HMM and refactored into their own methods Commented out the (unused) system for finding best N haplotypes for genotyping. PairHMMIndelErrorModel Similar changes were made as to the LCE. However, in this case the haplotypes are modified based on each individual read, so the read-list we feed into the HMM only has one read.	2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera	dac3dbc997	Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle Clarified caveat for InbreedingCoefficient Cleaned up docstrings for ReduceReads Brushed up doc for ErrorRatePerCycle	2013-11-13 14:33:04 -05:00
Eric Banks	0e3d83d1ef	Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline Improvements to the reference model pipeline.	2013-11-05 06:33:17 -08:00
Eric Banks	09dfaf1a68	Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline Add interpretation to QualifyMissingIntervals	2013-11-05 06:08:13 -08:00
Ryan Poplin	b22c9c2cb4	Improvements to the reference model pipeline. -- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants) -- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.	2013-11-01 17:58:25 -04:00
Mauricio Carneiro	5ed47988b8	Changed the parameter names from cds to baits Making the usage more clear since the parameter is being used over and over to define baited regions. Updated the headers accordingly and made it more readable.	2013-10-24 17:15:56 -04:00
Chris Hartl	9d932e8c60	Merged bug fix from Stable into Unstable	2013-10-10 14:31:33 -04:00
Chris Hartl	6f46d1187a	Remember to copy the integration test changes as well as the walker changes	2013-10-10 14:30:37 -04:00
Mauricio Carneiro	5d6421494b	Fix mismatching number of columns in report Quick fix the missing column header in the QualifyMissingIntervals report. Adding a QScript for the tool as well as a few minor updates to the GATKReportGatherer.	2013-10-09 14:38:15 -04:00
Mauricio Carneiro	63ace685c9	add unit tests	2013-10-04 11:44:07 -04:00
Mauricio Carneiro	839b918f58	Length metric updates to QualifyMissingIntervals * add a length of the overlaping interval metric as per CSER request * standardized the distance metrics to be positive when fully overlapping and the longest off-target tail (as a negative number) when not overlapping * add gatkdocs to the tool (finally!)	2013-10-04 10:18:13 -04:00
Geraldine Van der Auwera	9f7fa247f6	Disable VQSR tranche plots in INDEL mode	2013-09-30 17:14:37 -04:00
Ryan Poplin	ef1d58b7ff	Bugfix for hom ref records that aren't GVCF blocks.	2013-09-29 19:19:26 -04:00
Geraldine Van der Auwera	27808d336a	Minor clarifications regarding ignoreFilter argument	2013-09-26 13:13:53 -04:00
Geraldine Van der Auwera	a9fa5206ee	Merge pull request #399 from broadinstitute/eb_update_docs_for_DepthPerSampleHC Updated docs for DepthPerSampleHC to deliver PT #54237024.	2013-09-25 15:20:19 -07:00
Ryan Poplin	f362597f69	Merge pull request #400 from broadinstitute/mm_bugfix_combine_variants_implicit_casting Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.	2013-09-25 11:47:17 -07:00
Michael McCowan	5113e21437	Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion. * Updated expected test data in which an integer annotation (MQ0) was formatted as a double.	2013-09-25 13:12:02 -04:00
Eric Banks	2783c84c6b	Updated docs for DepthPerSampleHC to deliver PT #54237024 .	2013-09-24 22:32:19 -04:00
Eric Banks	d6992d1263	Updated docs to tell users not to use PCR indel error modeling for PCR free data.	2013-09-23 15:48:47 -04:00
Mauricio Carneiro	5bbad75402	Changing max coverage threshold Because Integer.maxValue is not unit testable	2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera	175388de1d	Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals	2013-09-20 15:12:16 -07:00
Mauricio Carneiro	5e2ffc74fc	Automated interpretation for QualifyMissingIntervals * add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval * add unit tests -- this tool is now public, we need tests. * slightly better docs -- in an effort to produce better docs for this tool	2013-09-20 16:47:12 -04:00
Mauricio Carneiro	74639463b9	Updating excessive coverage default parameter most people don't care about excessive coverage (unless you're very particular about your analysis). Therefore the best possible default value for this is Integer.maxValue so it doesn't get in the way. Itemized Changes: * change maximumCoverage threshold to Integer.maxValue [delivers #57353620]	2013-09-19 23:07:20 -04:00
MauricioCarneiro	014bc4269e	Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation Add Array Logless PairHMM	2013-09-08 20:16:53 -07:00
Ryan Poplin	3503050a39	Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller -- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller. -- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median -- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs -- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model -- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions -- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings -- We only realign reads in the reference model if the read is informative for a particular haplotype over another -- GVCF blocks will now track and output the minimum PLs over the block -- MD5 changes! -- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny -- GVCF tests: from HC changes above and adding in active region trimming	2013-09-06 16:56:34 -04:00
Ryan Poplin	add17dc463	Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error Changed the error for the record size mismatch in the genotyping engine ...	2013-08-30 10:29:54 -07:00
Eric Banks	ea0deb1bb2	Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to handle in the code).	2013-08-30 12:18:19 -04:00
Louis Bergelson	4473b0065e	adding a check for the UNAVAILABLE case of GenotypeType in CountVariants	2013-08-29 17:27:00 -04:00
bradtaylor	0435bbe38f	Retreived PairHMM benchmarks from archive and made improvements PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely. Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords". I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data. It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to: PairHMMBandedEmpiricalBenchmark PairHMMTestData BandedLoglessPairHMMUnitTest I decided against this as the banded benchmark and unit test are archived.	2013-08-28 17:23:35 -04:00
bradtaylor	86fe9fae76	Changes to Array PairHMM to address review comments Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.	2013-08-28 17:23:29 -04:00
bradtaylor	3671e41b0c	Add Array Logless PairHMM A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version. Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix. This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations. We also had to adjust the classes that employ the pairHMM: LikelihoodCalculationEngine (used by HaplotypeCaller) PairHMMIndelErrorModel (used by indel genotyping classes) Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection. The latter affects classes: ErrorModel GeneralPloidyIndelGenotypeLikelihoodsCalculationModel IndelGenotypeLikelihoodsCalculationModel ... all of which use the pairHMM via PairHMMIndelErrorModel	2013-08-28 17:21:23 -04:00
Ryan Poplin	7479152977	Merged bug fix from Stable into Unstable	2013-08-28 12:40:25 -04:00
Ryan Poplin	6bda569666	One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one.	2013-08-28 12:40:08 -04:00
Geraldine Van der Auwera	ed465cd2a5	Fixed a few typos and clarified some doc points.	2013-08-26 17:33:17 -04:00
David Roazen	42d771f748	Remove org.apache.commons.collections.IteratorUtils dependency from the test suite -This was a dependency of the test suite, but not the GATK proper, which caused problems when running the test suite on the packaged GATK jar at release time -Use GATKVCFUtils.readVCF() instead	2013-08-21 19:44:02 -04:00
Eric Banks	d4dc5ba04a	Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records. Added test to make sure this is no longer happening.	2013-08-21 15:46:57 -04:00
Eric Banks	6663d48ffe	Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates Adaptations to accomodate Tribble API changes.	2013-08-19 18:31:02 -07:00
Michael McCowan	c3a933ce84	Adaptations to accomodate Tribble API changes, comprising mostly of the following. * Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable. * Galvanizing fo generic types. * Test fixups, mostly to pass around LineIterators instead of LineReaders. * New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances. * New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).	2013-08-19 15:52:47 -04:00
Eric Banks	17eb7b49fe	Adding ability to use Ryan's PCR error modeling to the Haplotype Caller. There is now a command-line option to set the model to use in the HC. Incorporated Ryan's current (unmerged) branch in for most of these changes. Because small changes to the math can have drastic effects, I decided not to let users tweak the calculations themselves. Instead they can select either NONE, CONSERVATIVE (the default), or AGGRESSIVE. Note that any base insertion/deletion qualities from BQSR are still used. Also, note that the repeat unit x repeat length approach gave very poor results against the KB, so it is not included as an option here.	2013-08-16 01:53:04 -04:00
Eric Banks	69e78efeae	Merge pull request #366 from broadinstitute/gg_gatkdocfixes More gatkdoc fixes	2013-08-13 04:52:03 -07:00
Eric Banks	bcf9a1cda5	Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR Adding mode to VQSR to not output variant records that are filtered out ...	2013-08-12 12:01:50 -07:00
Ryan Poplin	a45011d7e7	Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling.	2013-08-12 11:22:59 -04:00
Ryan Poplin	59f56bef30	Cleaning up help text for the -numBad argument.	2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera	4d20c71e09	Improvements to various gatkdocs - Make -rod required - Document that contaminationFile is currently not functional with HC - Document liftover process more clearly - Document VariantEval combinations of ST and VE that are incompatible - Added a caveat about using MVLR from HC and UG. - Added caveat about not using -mte with -nt - Clarified masking options - Fixed docs based on Erics comments	2013-08-10 10:01:31 -07:00
Mark DePristo	b7d1096ced	Added onlyEmitSamples argument to UnifiedGenotyper -- When provided, this argument causes us to only emit the selected samples into the VCF. No INFO field annotations (AC for example) or other features are modified. It's current primary use is for efficiently evaluating joint calling. -- Add integration test for onlyEmitSamples	2013-08-09 11:00:15 -04:00
Mark DePristo	ccf0df0fea	Misc. debugging functionality to FS calculation (disabled by default)	2013-08-08 12:06:23 -04:00
Mark DePristo	00f4d767e4	Merge pull request #364 from broadinstitute/md_vqsr_improvements Separate num Gaussians for + and - GMM in VQSR	2013-08-07 04:37:45 -07:00
Mark DePristo	c21402d4af	Separate num Gaussians for + and - GMM in VQSR -- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models. However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model. -- Now the maxGaussians argument only applies to the positive model -- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter) -- Removes the percentBadVariants argument. The only way to control how many variants are included in the negative model is with minNumBad -- Reduced the minNumBad argument default to 1000 from 2500 -- Update MD5s for VQSR. md5s changed significantly due to underlying changes in the default GMM model. Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected. -- minNumBad is now numBad -- Plot all negative training points as well, since this significantly changes our view of the GMM PDF	2013-08-07 07:36:50 -04:00
Mark DePristo	318f7e74e4	Better docs on the meaning of heterozygosity -- [delivers #53522209]	2013-08-07 07:27:45 -04:00
Mark DePristo	40bc7d6a9c	Bugfix for ReferenceConfidenceModel -- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype". Now we detect this case and emit the standard no variation output. -- [delivers #54625060]	2013-08-06 16:00:32 -04:00
Ryan Poplin	a46f633bd6	Fix for the VQSR visualization script with the new ordering of annotations.	2013-08-02 19:10:45 -04:00
Mauricio Carneiro	285ab2ac62	Better caching for the HaplotypeCaller Problem ------- Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless. Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting tons of re-compute. Solution ------- Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.	2013-08-02 01:27:29 -04:00
Eric Banks	1e396af4d0	Two reduce reads updates/fixes: 1. Removing old legacy code that was capping the positional depth for reduced reads to 127. Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers. Added end to end unit test that depth counts in RR can be higher than max byte. Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127. 2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads. Refactored it so that it can only remove non-consensus reads.	2013-08-01 14:34:59 -04:00
Ryan Poplin	4f3411f3d4	Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples.	2013-07-31 10:48:55 -04:00
Yossi Farjoun	284176cd7b	moved SnpEffUtilUnitTest to public tree	2013-07-30 17:51:40 -04:00
droazen	b8709b1942	Merge pull request #332 from broadinstitute/st_fpga_hmm FPGA support for PairHMM	2013-07-30 14:21:21 -07:00
Joseph Rose	d2860a5486	Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose)	2013-07-30 15:38:32 -04:00
Mauricio Carneiro	7b731dd596	Removed native method call and fixed indentation.	2013-07-30 13:59:58 -04:00
Geraldine Van der Auwera	edbd17b8e0	Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode	2013-07-26 15:51:29 -04:00
Ryan Poplin	f52196496d	Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements More specific fix for the dangling tail edge case with a single leading deletion.	2013-07-26 07:25:47 -07:00
Ryan Poplin	8c205dda1b	Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line.	2013-07-26 10:22:43 -04:00
Eric Banks	9372c5ef41	Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals QualifyMissingIntervals: support different formats	2013-07-25 12:39:26 -07:00
sathibault	71eb944e62	Adding CnyPairHMMUnitTest	2013-07-25 14:19:50 -05:00
Eric Banks	5dfa863caa	Fully stranded implementation of RR (plus bug fix for insertions and het compression). Now only filtered reads are unstranded. All consensus reads have strand, so that we emit 2 consensus reads in general now: one for each strand. This involved some refactoring of the sliding window which cleaned it up a lot. Also included is a bug fix: insertions downstream of a variant region weren't triggering a stop to the compression.	2013-07-25 14:48:53 -04:00
Eric Banks	0a2b5ddadf	More specific fix for the dangling tail edge case with a single leading deletion. The previous fix was too general (and therefore incorrect) and caused the HC to exception out. Added "unit" test for this exact case.	2013-07-25 12:24:46 -04:00
Mauricio Carneiro	31ab0824b1	quick indentation fixes to FPGA code	2013-07-24 14:09:49 -04:00
Eric Banks	6df43f730a	Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes. Having them as bytes caused problems for downstream programmers who had data with high MQs.	2013-07-23 23:47:15 -04:00
Guillermo del Angel	9dd109b79a	Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised. So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before. Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.	2013-07-18 12:54:52 -04:00
Scott Thibault	5d198d3400	Added write to likelihoods.txt for batch hmm	2013-07-15 10:16:39 -05:00
sathibault	0a8f75b953	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java	2013-07-15 08:17:32 -05:00
Mauricio Carneiro	8c07614321	QualifyMissingIntervals: support different formats Problem ------- Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters. Solution ------- There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle). ps: Also added an average depth column to the output	2013-07-12 17:32:53 -04:00
Yossi Farjoun	afcf7b96db	- Added per-sample AlleleBiasedDownsampling capability to HaplotypeCaller - Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF - overrode default contamination value in test	2013-07-12 16:22:02 -04:00
Eric Banks	b16c7ce050	A whole slew of improvements to the Haplotype Caller and related code. 1. Some minor refactorings and claenup (e.g. removing unused imports) throughout. 2. Updates to the KB assessment functionality: a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call. b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling. 3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived the DeBruijn assembler. 4. Improvements to the likelihoods for the HC a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need to divide e by 3 because the observed base could have come from any of the non-observed alleles. b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are turned into Q0s for now. c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real sites with off kilter allele balances and we occasionally lose them to contamination removal. 5. Improved the dangling tail merging implementation.	2013-07-12 10:09:10 -04:00
sathibault	23fe3e449a	Revert "Fixed batching bug." This reverts commit 3e56c83d0eec7c374e5f187d1ef124d42ecc071e.	2013-07-11 11:30:37 -05:00
sathibault	7458b59bb3	Fixed batching bug.	2013-07-11 11:08:46 -05:00
Menachem Fromer	a8ea57df9e	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-07-10 16:44:35 -04:00
Guillermo del Angel	aba55dbb23	Moved some HC parameters related to active region extensions to command line arguments so that they're more easily modified. Some of these parameters need tinkering in order to call some large indels. See GSA-891 and subtasks for particular examples thereof.	2013-07-10 14:31:10 -04:00
Eric Banks	73fc7f6ab1	Reduce Reads output should never be expected to be sorted (hence the need to sort on disk) but for some reason it was with -nwayout mode.	2013-07-08 10:33:36 -04:00
Eric Banks	5f5c90e65c	Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site. Trivial fix, added integration test to cover it.	2013-07-05 00:04:52 -04:00
Mark DePristo	5f34054cc1	Remove filtering of MAPQ 0 reads from CalledHaplotypeBAMWriter	2013-07-02 15:46:49 -04:00
Mark DePristo	ed0b1c5aba	Fix bug in ReadThreadingAssembler in cycle failures causing NPE	2013-07-02 15:46:48 -04:00
Mark DePristo	e3e8631ff5	Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences -- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure -- -- Output format looks like: 20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121 20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217 20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240 20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268 20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267 We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty. -- Can we enabled for single samples with --emitRefConfidence (-ERC). -- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads -- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class. -- Includes GVCF writer -- Add 1 mb of WEx data to private/testdata -- Integration tests for reference model output for WGS and WEx data -- Emit GQ block information into VCF header for GVCF mode -- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC -- Control max indel size for the reference confidence model from the command line. Increase default to 10 -- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest -- Unittests for ReferenceConfidenceModel -- Unittests for new MathUtils functions	2013-07-02 15:46:38 -04:00
Mark DePristo	41aba491c0	Critical bugfix for adapter clipping in HaplotypeCaller -- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible. -- Removed the broken logic of determining whether a read adaptor is too long. -- Doesn't require isProperPairFlag to be set for a read to be adapter clipped -- Update integration tests for new adapter clipping code	2013-07-02 15:46:36 -04:00
Scott Thibault	82dcdc01c0	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java	2013-06-28 10:13:05 -05:00
Scott Thibault	e691fa3e19	FPGA null pointer bug fix	2013-06-28 08:52:09 -05:00
Ryan Poplin	825b603acb	Merge pull request #298 from broadinstitute/md_likelihood_rank_sum Md likelihood rank sum	2013-06-27 11:14:25 -07:00
Mark DePristo	a514dd0643	Merge pull request #307 from broadinstitute/eb_rr_off_by_one_error Proper fix for previous RR -cancer_mode fix.	2013-06-26 13:02:23 -07:00
Eric Banks	876e40466a	Proper fix for previous RR -cancer_mode fix. I "fixed" this once before but instead of testing with unit tests I used integration tests. Bad decision. The proper fix is in now, with a bonafide unit test included.	2013-06-26 14:48:09 -04:00
Eric Banks	f242be12c0	Make this walker @Hidden	2013-06-26 11:45:21 -04:00
Mark DePristo	ff76d0c877	Merge pull request #304 from broadinstitute/eb_rr_header_negative_fix_again Fixing the 'header is negative' problem in Reduce Reads... again.	2013-06-24 11:55:52 -07:00
Eric Banks	165b936fcd	Fixing the 'header is negative' problem in Reduce Reads... again. Previous fixes and tests only covered trailing soft-clips. Now that up front hard-clipping is working properly though, we were failing on those in the tool. Added a patch for this as well as a separate test independent of the soft-clips to make sure that it's working properly.	2013-06-24 14:06:21 -04:00
Valentin Ruano-Rubio	b97f9a487d	Merged bug fix from Stable into Unstable	2013-06-24 14:00:01 -04:00
Mark DePristo	191e4ca251	Merge pull request #300 from broadinstitute/mc_move_qualify_intervals_to_protected Few bug fixes to this tool now that it is in protected	2013-06-24 09:35:45 -07:00
Valentin Ruano-Rubio	3e5ff6095f	Added the pertinent DocumentedGATKFeature annotation ot AnalyzeCovariates	2013-06-21 17:02:26 -04:00
Eric Banks	d976aae2b1	Another fix for the Indel Realigner that arises because of secondary alignments. This time we don't accidentally drop reads (phew), but this bug does cause us not to update the alignment start of the mate. Fixed and added unit test to cover it.	2013-06-21 16:59:22 -04:00
Mark DePristo	8caf39cb65	Experimental LikelihoodRankSum annotation -- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class -- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported. -- Added @Override to many of the RankSum classes for safety's sake -- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs, -- R script to assess the value of annotations for VQSR	2013-06-21 13:57:11 -04:00
Mark DePristo	f726d8130a	VariantRecalibrator bugfix for bad log10sumlog10 values -- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.	2013-06-21 12:28:53 -04:00
Mark DePristo	dee51c4189	Error out when NCT and BAMOUT are used with the HaplotypeCaller -- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled. Check in initialize if this is the case, and throw a UserException	2013-06-21 09:25:57 -04:00
Mark DePristo	fdfe4e41d5	Better GATK version and command line output -- Previous version emitted command lines that look like: ##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..." the new version provides additional information on when the GATK was run and the GATK version in a nicer format: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ..."> -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff"> ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff"> -- Removed the ProtectedEngineFeaturesIntegrationTest -- Actual unit tests for these features!	2013-06-20 11:19:13 -04:00
sathibault	3db8908ae8	Remove debug print statement	2013-06-20 08:28:58 -05:00
Mark DePristo	0672ac5032	Fix public / protected dependency	2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio	1f8282633b	Removed plots generation from the BaseRecalibration software Improved AnalyzeCovariates (AC) integration test. Renamed AC test files ending with .grp to .table Implementation: * Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly. * Fixed BQSRIntegrationTest to work with new changes	2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio	08f92bb6f9	Added AnalyzeCovariates tool to generate BQSR assessment quality plots. Implemtation details: * Added tool class .AnalyzeCovariates Added convenient addAll method to Utils to be able to add elements of an array. * Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable. * Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output. * Added an integration test class.	2013-06-19 14:38:02 -04:00
Guillermo del Angel	f176c854c6	Swapping in logless Pair HMM for default usage with UG: -- Changed default HMM model. -- Removed check. -- Changed md5's: PL's in the high 100s change by a point or two due to new implementation. -- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.	2013-06-18 10:06:27 -04:00
Ryan Poplin	8511c4385c	Adding new pruning parameter to ReadThreadingAssembler -- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample. -- adding unit test to test this new functionality	2013-06-17 16:46:40 -04:00
Guillermo del Angel	f6025d25ae	Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing: -- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will: a) Emit all 3 ALT snp alleles in the ALT column. b) Emit all corresponding 10 PL values. It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields. -- Tweaks and fixes to processing pipelines for Reich lab.	2013-06-17 13:21:09 -04:00
delangel	485ceb1e12	Merge pull request #283 from broadinstitute/md_beagleoutput Simpler FILTER and info field encoding for BeagleOutputToVCF	2013-06-17 09:31:03 -07:00
Eric Banks	e48f754478	Fixes to several of the annotations for reduced reads (and other issues). 1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads. 2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations). 3. Fixed a massive bug in the BaseQualityRankSumTest annotation! It was not using the base qualities but rather the read likelihoods?! Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values. Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and makes insignificant distributions look significant (so it can falsely cause the filtering of good sites). Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.	2013-06-16 01:18:20 -04:00
Mark DePristo	1677a0a458	Simpler FILTER and info field encoding for BeagleOutputToVCF -- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele. -- Here's an example output line from the previous and current versions: old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88 new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15 -- update MD5s to reflect these changes -- [delivers #50847721]	2013-06-14 15:56:13 -04:00
Mark DePristo	dd5674b3b8	Add genotyping accuracy assessment to AssessNA12878 -- Now table looks like: Name VariantType AssessmentType Count variant SNPS TRUE_POSITIVE 1220 variant SNPS FALSE_POSITIVE 0 variant SNPS FALSE_NEGATIVE 1 variant SNPS TRUE_NEGATIVE 150 variant SNPS CALLED_NOT_IN_DB_AT_ALL 0 variant SNPS HET_CONCORDANCE 100.00 variant SNPS HOMVAR_CONCORDANCE 99.63 variant INDELS TRUE_POSITIVE 273 variant INDELS FALSE_POSITIVE 0 variant INDELS FALSE_NEGATIVE 15 variant INDELS TRUE_NEGATIVE 79 variant INDELS CALLED_NOT_IN_DB_AT_ALL 2 variant INDELS HET_CONCORDANCE 98.67 variant INDELS HOMVAR_CONCORDANCE 89.58 -- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously) -- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record: 20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6 Indicating that the call was a HET but the expected result was HOM_VAR -- Forbid subsetting of diploid genotypes to just a single allele. -- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG	2013-06-13 15:05:32 -04:00
Mark DePristo	33720b83eb	No longer merge overlapping fragments from HaplotypeCaller -- Merging overlapping fragments turns out to be a bad idea. In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small. That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference. Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging. Best to leave haplotype construction to the assembly graph.	2013-06-13 15:05:32 -04:00
Mark DePristo	c837d67b2f	Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled Relaxing the constraints on the readIsPoorlyModelled function.	2013-06-13 08:40:24 -07:00
Ryan Poplin	f44efc27ae	Relaxing the constraints on the readIsPoorlyModelled function. -- Turns out we were aggressively throwing out borderline-good reads.	2013-06-13 11:06:23 -04:00
Ryan Poplin	d5f0848bd5	HC bam writer now sets the read to MQ0 if it isn't informative -- Makes visualization of read evidence easier in IGV.	2013-06-13 10:11:54 -04:00
sathibault	336050ab71	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java	2013-06-13 07:28:24 -05:00
Ryan Poplin	d1f397c711	Fixing bug with dangling tails in which the tail connects all the way back to the reference source node. -- List of vertices can't contain a source node.	2013-06-12 12:23:01 -04:00
Ryan Poplin	e1fd3dff9a	Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC Eb calling accuracy improvements to hc	2013-06-11 11:18:51 -07:00
Eric Banks	2c3c680eb7	Misc changes and cleanup from all previous commits in this push. 1. By default, do not include the UG CEU callset for assessment. 2. Updated md5s that are different now with all the HC changes.	2013-06-11 12:53:11 -04:00
Eric Banks	dadcfe296d	Reworking of the dangling tails merging code. We now run Smith-Waterman on the dangling tail against the corresponding reference tail. If we can generate a reasonable, low entropy alignment then we trigger the merge to the reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs and don't let those pass through. Added tests for this implementation that checks exact SW results and correct edges added.	2013-06-11 12:53:04 -04:00
Guillermo del Angel	55d5f2194c	Read Error Corrector for haplotype assembly Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps: 1. K-mer graph buildup. For each read in an active region, a map from k-mers to the number of times they have been seen is built. 2. Building correction map. All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable). 3. Correction proposal For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty. The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent. By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.	2013-06-11 12:26:24 -04:00
Eric Banks	c0030f3f2d	We no longer subset down to the best N haplotypes for the GL calculation. I explain in comments within the code that this was causing problems with the marginalization over events.	2013-06-11 11:51:26 -04:00
Eric Banks	c0e3874db0	Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to.	2013-06-11 11:51:26 -04:00
Eric Banks	77868d034f	Do not allow the use of Ns in reads for graph construction. Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones. This was the cause of at least one FN where there were Ns at a SNP position.	2013-06-11 11:51:26 -04:00
Eric Banks	e4e7d39e2c	Fix FN problem stemming from sequence graphs that contain cycles. Problem: The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers. Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final sequence graph. Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions. Solution: We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them. If (and only if) we can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate a graph without cycles or attempt too many iterations.	2013-06-11 11:51:26 -04:00
Ryan Poplin	58e354176e	Minor changes to docs in the graph pruning.	2013-06-11 10:33:22 -04:00
Mark DePristo	1c03ebc82d	Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP -- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure -- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map -- Cleanup misc. code as it came up -- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC	2013-06-10 16:20:31 -04:00
Mark DePristo	0d593cff70	Refactor rsID and overlap detection in VariantOverlapAnnotator utility class -- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field. -- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation -- Added unit tests for this VariantOverlapAnnotator class -- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator -- Now requires strict allele matching, without any option to just use site annotation.	2013-06-10 15:51:13 -04:00
Mauricio Carneiro	1d67d07cf1	better docs for Qualify Missing Intervals now that it's available to the public, better give'em good docs!	2013-06-10 15:17:40 -04:00
Mauricio Carneiro	c84f0deb1d	Don't crash if cds file is not provided CDS file should be optional.	2013-06-10 13:42:00 -04:00
Mauricio Carneiro	a95fbd48e5	Moving QualifyMissingIntervals to protected Making this walker available so we can share it with the CSER group for CLIA analysis.	2013-06-10 13:11:41 -04:00
Valentin Ruano-Rubio	96073c3058	This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK. The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible. Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward: a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC) b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL) Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring. Implementation: * Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work. * Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants. * Added ReadFilterTest class commont parent for ReadFilter test cases. * Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class. * Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality. * Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used. * Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used. * Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior. * Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior	2013-06-10 10:44:42 -04:00
Michael McCowan	00c06e9e52	Performance improvements: - Memoized MathUtil's cumulative binomial probability function. - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.	2013-06-09 11:26:52 -04:00
Mark DePristo	209dd64268	HaplotypeCaller now emits per-sample DP -- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller -- The depth for the HC is the sum of the informative alleles at this site. It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP). -- Update MD5s -- delivers [#48240601]	2013-06-06 09:47:32 -04:00
Mark DePristo	34bdf20132	Bugfix for bad AD values in UG/HC -- In the case where we have multiple potential alternative alleles and we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles. -- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future. -- delivers [#50897077]	2013-06-05 17:48:41 -04:00
Mark DePristo	c9f5b53efa	Bugfix for HC can fail to assemble the correct reference sequence in some cases -- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger. In the case where you have this graph: ACT [ref source] -> C G -> ACT -> C we would merge into G -> ACT -> C which would linearlize into GACTC Causing us to add bases to the reference source node that couldn't be recovered. The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves. -- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM). This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error -- Did some misc. cleanup of code while debugging -- [delivers #50917729]	2013-06-03 16:16:39 -04:00
Ryan Poplin	ab40f4af43	Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler. -- Added unit test for new function.	2013-06-03 14:00:35 -04:00
Ryan Poplin	21334e728d	Merge pull request #252 from broadinstitute/md_bqsr_index_out_of_bounds Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read	2013-06-03 07:13:00 -07:00
sathibault	de2a2a4cc7	Added command-line flag to disble FPGA Completed integration with FPGA driver	2013-06-03 07:30:32 -05:00
Mark DePristo	6555361742	Fix error in merging code in HC -- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested. -- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases. -- delivers #50655421	2013-05-31 16:29:29 -04:00
Mark DePristo	64b4d80729	Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read -- The previous implementation attempted to be robust to this, but not all cases were handled properly. Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works. The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand. Everything is now consistent, as can be seen in the symmetry of the unit tests: tests.add(new Object[]{"1D3M", false, EventType.BASE_DELETION, new int[]{0,0,0}}); tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}}); tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}}); tests.add(new Object[]{"3M1D", false, EventType.BASE_DELETION, new int[]{0,0,1}}); tests.add(new Object[]{"1D3M", true, EventType.BASE_DELETION, new int[]{1,0,0}}); tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}}); tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}}); tests.add(new Object[]{"3M1D", true, EventType.BASE_DELETION, new int[]{0,0,0}}); tests.add(new Object[]{"4M1I", false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}}); tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}}); tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}}); tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}}); tests.add(new Object[]{"1I4M", false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}}); tests.add(new Object[]{"4M1I", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}}); tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}}); tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}}); tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}}); tests.add(new Object[]{"1I4M", true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}}); -- delivers #50445353	2013-05-31 13:58:37 -04:00
Ryan Poplin	b5b9d745a7	New implementation of the GGA mode in the HaplotypeCaller -- We now inject the given alleles into the reference haplotype and add them to the graph. -- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode. -- This unifies how Smith-Waterman is performed between discovery and GGA modes. -- Misc minor cleanup in several places.	2013-05-31 10:35:36 -04:00
Ryan Poplin	61af37d0d2	Create a new normalDistributionLog10 function that is unit tested for use in the VQSR.	2013-05-30 16:00:08 -04:00
Eric Banks	a5a68c09fa	Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads. The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not clips should be ignored. Also rebased so that return type is no longer a Pair. Added unit test to cover this situation.	2013-05-29 16:41:01 -04:00
Mark DePristo	684c91c2e7	Merge pull request #245 from broadinstitute/dr_enforce_min_dcov Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage	2013-05-29 09:52:13 -07:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
Mauricio Carneiro	f1affa9fbb	Turn off downsampling for DiagnoseTargets Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)	2013-05-28 14:58:50 -04:00
Ryan Poplin	85905dba92	Bugfix for GGA mode in UG silently ignoring indels -- Started by Mark. Finished up by Ryan. -- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls. -- https://www.pivotaltracker.com/story/show/48924339 for more information -- [resolves #48924339]	2013-05-24 13:47:26 -04:00
Mauricio Carneiro	da21924b44	Make the missing targets output never use stdout Problem -------- Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided Solution -------- Make it NOT default to stdout [Delivers #50386741]	2013-05-22 14:22:54 -04:00
Mark DePristo	d167743852	Archived banded logless PairHMM BandedHMM --------- -- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference Useful infrastructure changes ----------------------------- -- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM -- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class. -- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.	2013-05-22 12:24:00 -04:00
Mark DePristo	a1093ad230	Optimization for ActiveRegion.removeAll -- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one. -- Update HaplotypeCaller filterReadsInRegion to use a Set not a List. -- Expanded the unit tests a bit for ActiveRegion.removeAll	2013-05-21 16:18:57 -04:00
Eric Banks	1f3624d204	Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing	2013-05-21 11:35:05 -04:00
Valentin Ruano Rubio	71bbb25c9e	Merge pull request #231 from broadinstitute/md_combinevariants_bugfix CombineVariants no longer adds PASS to unfiltered records	2013-05-20 14:28:20 -07:00
Mark DePristo	62fc88f92e	CombineVariants no longer adds PASS to unfiltered records -- [Delivers #49876703] -- Add integration test and test file -- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!	2013-05-20 16:53:51 -04:00
Ryan Poplin	507853c583	Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result. -- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.	2013-05-20 14:29:04 -04:00
Eric Banks	8a442d3c9f	@Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating.	2013-05-17 10:04:10 -04:00
sathibault	195f0c3e98	Disable CnyPairHMM	2013-05-17 08:30:23 -05:00
Yossi Farjoun	9234a0efcd	Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs Bug fixes and missing interval functionality for Diagnose Targets While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule). It is with these cavets that I accept the pull request.	2013-05-16 09:25:54 -07:00
Chris Hartl	6da0aed30f	Update GCIT md5s to account for trivial changes to description strings	2013-05-14 19:45:30 -04:00
Yossi Farjoun	409a202492	Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC Add overall genotype concordance to the genotype concordance tool. In ad...	2013-05-14 14:19:54 -07:00
Menachem Fromer	de54223aed	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-14 10:15:21 -04:00
Mauricio Carneiro	adcbf947bf	Update MD5s and the Diagnose Target scala script	2013-05-13 12:06:17 -04:00
Mauricio Carneiro	9eceae793a	Tool to manipulate intervals outside the GATK Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.	2013-05-13 11:56:24 -04:00
Mauricio Carneiro	3dbb86b052	Outputting missing intervals in DiagnoseTargets Problem ------ Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci Solution ------ This commit implements an optional intervals file output listing the exact loci that did not pass filters Itemized changes -------------- * Cache callable statuses (to avoid recalculation) * Add functionality to output missing intervals * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)	2013-05-13 11:51:56 -04:00
Mauricio Carneiro	1466396a31	Diagnose target is outputting intervals out of order Problem ------- When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF. Solution -------- Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well. Itemized changes -------------- * Clean up code duplication in LocusStratification and SampleStratification * Add number of uncovered sites and number of low covered sites to the VCF output. * Add new VCF format fields * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported) [fixes #48780333] [fixes #48787311]	2013-05-13 11:50:22 -04:00
Mark DePristo	b4f482a421	NanoScheduled ActiveRegionTraversal and HaplotypeCaller -- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results -- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk -- Tests for NanoScheduling -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data -- Some misc. code cleanup of HaplotypeCaller -- Analysis scripts to assess performance of nano scheduled HC -- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.	2013-05-13 11:09:02 -04:00
Eric Banks	2f5ef6db44	New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points. * This version inherits from the original SW implementation so it can use the same matrix creation method. * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the right thing for indels at the edges of the alignments. * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy. * Lots of systematic testing added for this implementation. * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.	2013-05-13 09:36:39 -04:00
Mark DePristo	111e8cef0f	Merge pull request #219 from broadinstitute/eb_rr_multisample_fix Fix bug in Reduce Reads that arises in multi-sample mode.	2013-05-09 15:31:14 -07:00
Eric Banks	8b9c6aae3e	Fix bug in Reduce Reads that arises in multi-sample mode. * bitset could legitimately be in an unfinished state but we were trying to access it without finalizing. * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode. * tests were easiest to implement as integration tests (this was a really complicated case).	2013-05-08 23:23:51 -04:00
Mark DePristo	fa8a47ceef	Replace DeBruijnAssembler with ReadThreadingAssembler Problem ------- The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant. Solution -------- The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows: identify all non-unique kmers of size K among all reads and the reference for each sequence (ref and read): find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist for each base in the sequence from the starting vertex kmer: look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers. This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below: -- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes) -- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph. -- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used. -- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway. -- The old DebruijnAssembler is still available with a command line option -- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples. -- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes -- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler Infrastructure changes / improvements ------------------------------------- -- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler -- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers -- Refactored active region trimming code into ActiveRegionTrimmer class -- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default -- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions. -- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations. -- Many many more unit tests throughout the entire assembler, and in random utilities	2013-05-08 21:41:42 -04:00
sathibault	d79b5f0931	Adding Convey HC-1 HMM acceleration	2013-05-08 11:01:20 -05:00
Eric Banks	d242f1bba3	Secondary alignments were not handled correctly in IndelRealigner * This is emerging now because BWA-MEM produces lots of reads that are not primary alignments * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it was getting confused by these secondary alignments * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Eric Banks	b53336c2d0	Added hidden mode for BQSR to force all read groups to be the same one. * Very useful for debugging sample-specific issues * This argument got lost in the transition from BQSR v1 to v2 * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Menachem Fromer	c7dcc2b53b	Fix to deal with multi-generational families being allowed if only one level (one 'trio', effectively) appears in the VCF	2013-05-06 15:47:27 -04:00
Menachem Fromer	86287dce76	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-06 13:52:55 -04:00
Menachem Fromer	13240588cf	Fix to only consider the samples that are both in the PED file and in the VCF file	2013-05-06 13:52:14 -04:00
Chris Hartl	6ff74deac7	Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior. Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.	2013-05-06 13:06:30 -04:00
chartl	98021db264	Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance - Fixed a small bug in the printout of molten data in GenotypeConcordanc...	2013-05-06 08:42:06 -07:00
Menachem Fromer	78e958bf39	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-06 10:39:21 -04:00
Guillermo del Angel	874dc8f9c1	Re-fix md5's that changed due to conflicting pushes	2013-05-03 14:59:16 -04:00
Mark DePristo	f42bb86bdd	e# This is a combination of 2 commits. Only try to clip adaptors when both reads of the pair are on opposite strands -- Read pairs that have unusual alignments, such as two reads both oriented like: <----- <----- where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads. -- Update MD5s due change in adaptor clipping, which add some coverage in some places	2013-05-03 11:19:14 -04:00
Mark DePristo	2bcbdd469f	leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase.	2013-05-03 09:32:05 -04:00
Guillermo del Angel	0c30a5ebc6	Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VALUE) when converting from GL to integers. Increase capping to Integer.MAX_VALUE (2^31-1) which should be enough for reasonable sites now. Integration tests change because some tests have some hyper-deep pileups where this case was hit	2013-05-02 16:31:43 -04:00
Yossi Farjoun	4b8b411b92	- Fixed a small bug in the printout of molten data in GenotypeConcordance Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc. This was only a problem in the text, _not_ the actual numbers, which were outputted correctly. - Updated MD5's after looking at diffs to verify that the change is what I expected.	2013-05-02 09:16:07 -04:00
David Roazen	f3c94a3c87	Update expected test output for Java 7 -Changes in Java 7 related to comparators / sorting produce a large number of innocuous differences in our test output. Updating expectations now that we've moved to using Java 7 internally. -Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent intermittent failures.	2013-05-01 16:18:01 -04:00
Eric Banks	58424e56be	Setting the reduce reads count tag was all wrong in a previous commit; fixing. RR counts are represented as offsets from the first count, but that wasn't being done correctly when counts are adjusted on the fly. Also, we were triggering the expensive conversion and writing to binary tags even when we weren't going to write the read to disk. The code has been updated so that unconverted counts are passed to the GATKSAMRecord and it knows how to encode the tag correctly. Also, there are now methods to write to the reduced counts array without forcing the conversion (and methods that do force the conversion). Also: 1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows about the internal encoding. 2. as discussed in meetings today, we updated the encoding so that it can now handle a range of values that extends to 255 instead of 127 (and is backwards compatible). 3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.	2013-04-30 13:45:42 -04:00
Guillermo del Angel	20d3137928	Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should. -- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype. -- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.	2013-04-29 19:33:02 -04:00
Mark DePristo	0387ea8df9	Bugfix for ReadClipper with ReducedReads -- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts. Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly. Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read -- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously	2013-04-29 11:12:09 -04:00
Mark DePristo	5dd73ba2d1	Merge pull request #198 from broadinstitute/mc_reduce_reads_ds_doc Updates GATKDocs for ReduceReads downsampling	2013-04-27 05:49:47 -07:00
Mauricio Carneiro	76e997895e	Updates GATKDocs for ReduceReads downsampling [fixes #48258295]	2013-04-26 23:33:44 -04:00
Guillermo del Angel	4168aaf280	Add feature to specify Allele frequency priors by command line when calling variants. Use case: The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele. Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence). Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions. With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs. -- Updated ancient DNA single sample calling script with filtering options and other cleanups. -- Added integration test. Removed old -noPrior syntax.	2013-04-26 19:06:39 -04:00
Mark DePristo	759c531d1b	Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check Add support for snpEff "GATK compatibility mode" (-o gatk)	2013-04-26 13:55:14 -07:00
David Roazen	7d90bbab08	Add support for snpEff "GATK compatibility mode" (-o gatk) -Do not throw an exception when parsing snpEff output files generated by not-officially-supported versions of snpEff, PROVIDED that snpEff was run with -o gatk -Requested by the snpEff author -Relevant integration tests updated/expanded	2013-04-26 15:47:15 -04:00
Mark DePristo	071fd67d55	Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads Eb contamination fixing for reduced reads	2013-04-26 09:48:45 -07:00
Mark DePristo	92a6c7b561	Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants Fixed bug reported on the forum where using the --exclude_sample_file ar...	2013-04-26 09:47:38 -07:00
Eric Banks	360e2ba87e	Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results. Added integration test. https://www.pivotaltracker.com/s/projects/793457/stories/47399245	2013-04-26 12:23:11 -04:00
Eric Banks	021adf4220	WTF - I thought we had disabled the randomized dithering of rank sum tests for integration tests?! Well, it wasn't done so I went ahead and did so. Lots of MD5 changes accordingly.	2013-04-26 11:24:05 -04:00
Eric Banks	ba2c3b57ed	Extended the allele-biased down-sampling functionality to handle reduced reads. Note that this works only in the case of pileups (i.e. coming from UG); allele-biased down-sampling for RR just cannot work for haplotypes. Added lots of unit tests for new functionality.	2013-04-26 11:23:17 -04:00
Mark DePristo	d20be41fee	Bugfix for FragmentUtils.mergeOverlappingPairedFragments -- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences. If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference. Updated the code to first clip the adapter sequences from the incoming fragments -- Update MD5s	2013-04-25 11:11:15 -04:00
Eric Banks	379a9841ce	Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads. 1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup. 2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We now use a PriorityQueue instead of a TreeSet to allow for such cases. 3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number should be shared between them. 4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that information. Maybe we'll decide to change this in the future, but for now we are conservative. 5. Also implemented various small performance optimizations based on profiling. Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.	2013-04-24 18:18:50 -04:00
MauricioCarneiro	45fec382e7	Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets DiagnoseTargets Global Refactor	2013-04-24 14:54:55 -07:00
Mauricio Carneiro	367f0c0ac1	Split class names into stratification and metrics Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric. Metrics are generalized by the Metric interface. (for generic access) Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)	2013-04-24 14:15:49 -04:00
Ryan Poplin	80131ac996	Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices.	2013-04-24 11:41:32 -04:00
Guillermo del Angel	2ab270cf3f	Corner case fix to General Ploidy SNP likelihood model. -- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0 -- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.	2013-04-23 21:13:18 -04:00
Mauricio Carneiro	8f8f339e4b	Abstract class for the statistics Addressing the code duplication issue raised by Mark.	2013-04-23 18:02:27 -04:00
Mauricio Carneiro	38662f1d47	Limiting access to the DT classes * Make most classes final, others package local * Move to diagnostics.diagnosetargets package * Aggregate statistics and walker classes on the same package for simplified visibility. * Make status list a LinkedList instead of a HashSet	2013-04-23 14:01:43 -04:00
Ryan Poplin	cb4ec3437a	After debate reverting SW parameter changes temporarily while we explore global SW plans.	2013-04-23 13:32:06 -04:00
Mauricio Carneiro	fdd16dc6f9	DiagnoseTargets refactor A plugin enabled implementation of DiagnoseTargets Summarized Changes: ------------------- * move argument collection into Thresholder object * make thresholder object private member of all statistics classes * rework the logic of the mate pairing thresholds * update unit and integration tests to reflect the new behavior * Implements Locus Statistic plugins * Extend Locus Statistic plugins to determine sample status * Export all common plugin functionality into utility class * Update tests accordingly [fixes #48465557]	2013-04-22 23:53:10 -04:00
Mauricio Carneiro	eb6308a0e4	General DiagnoseTargets documentation cleanup * remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps. * add gatkdocs to all the parameters * clean up the logic on callable status a bit (still need to be re-worked into a plugin system) * update integration tests	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	b3c0abd9e8	Remove REF_N status from DiagnoseTargets This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply. [fixes #48442663]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	2b923f1568	fix for DiagnoseTargets multiple filter output Problem ------- Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval Solution -------- Only allow low median coverage check if there are reads [fixes #48442675]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	cf7afc1ad4	Fixed "skipped intervals" bug on DiagnoseTargets Problem ------- Diagnose targets was skipping intervals when they were not covered by any reads. Solution -------- Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file. Summarized Changes ------------------ * Outputs all intervals it iterates over, even if uncovered * Outputs leftover intervals in the end of the traversal * Updated integration tests [fixes #47813825]	2013-04-22 23:53:09 -04:00
Mark DePristo	be66049a6f	Bugfix for CommonSuffixSplitter -- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification -- Update MD5s, including minor improvements	2013-04-21 19:37:01 -04:00
Mark DePristo	f0e64850da	Two sensitivity / specificity improvements to the haplotype caller -- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data -- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8. -- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller NA12878 MEM chr20:10-11 Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL branch SNPS 1216 0 2 194 0 branch INDELS 312 2 13 71 7 master SNPS 1214 0 4 194 1 master INDELS 309 2 16 71 10 -- Update MD5s in the integration tests to reflect these two new changes	2013-04-17 12:32:31 -04:00
Eric Banks	5bce0e086e	Refactored binomial probability code in MathUtils. * Moved redundant code out of UGEngine * Added overloaded methods that assume p=0.5 for speed efficiency * Added unit test for the binomialCumulativeProbability method	2013-04-16 18:19:07 -04:00
Eric Banks	df189293ce	Improve compression in Reduce Reads by incorporating probabilistic model and global het compression The Problem: Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most regions in an exome) causes RR (with default settings) to consider it a variant region. This seriously hurts compression performance. The Solution: 1. We now use a probabilistic model for determining whether we can create a consensus (in other words, whether we can error correct a site) instead of the old ratio threshold. We calculate the cumulative binomial probability of seeing the given ratio and trigger consensus creation if that pvalue is lower than the provided threshold (0.01 by default, so rather conservative). 2. We also allow het compression globally, not just at known sites. So if we cannot create a consensus at a given site then we try to perform het compression; and if we cannot perform het compression that we just don't reduce the variant region. This way very wonky regions stay uncompressed, regions with one errorful read get fully compressed, and regions with one errorful locus get het compressed. Details: 1. -minvar is now deprecated in favor of -min_pvalue. 2. Added integration test for bad pvalue input. 3. -known argument still works to force het compression only at known sites; if it's not included then we allow het compression anywhere. Added unit tests for this. 4. This commit includes fixes to het compression problems that were revealed by systematic qual testing. Before finalizing het compression, we now check for insertions or other variant regions (usually due to multi-allelics) which can render a region incompressible (and we back out if we find one). We were checking for excessive softclips before, but now we add these tests too. 5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus out instead of backing out all of them. 6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we allow it to be part of the next global consensus. 7. The coverage test is no longer run systematically on all integration tests because the quals test supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases (very useful!). 8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ for a consensus was affected by good and bad bases/reads). 9. We now completely ignore low quality bases, unless they are the only bases present in a pileup. This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15. 10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start with insertions from a header. Note that this commit must come serially before the next commit in which I am refactoring the binomial prob code in MathUtils (which is failing and slow).	2013-04-16 18:19:06 -04:00
Ryan Poplin	e0dfe5ca14	Restore the read filter function in the HaplotypeCaller.	2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera	e176fc3af1	Merge pull request #159 from broadinstitute/md_bqsr_ion Trivial BQSR bug fixes and improvement	2013-04-16 08:54:47 -07:00
Ryan Poplin	936f4da1f6	Merge pull request #166 from broadinstitute/md_hc_persample_haplotypes Select the haplotypes we move forward for genotyping per sample, not poo...	2013-04-16 08:46:56 -07:00
Mark DePristo	17982bcbf8	Update MD5s for VQSR header change	2013-04-16 11:45:45 -04:00
Mark DePristo	067d24957b	Select the haplotypes we move forward for genotyping per sample, not pooled -- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller. -- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast. -- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected. -- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871. -- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples: 1000G EUR samples: 10Kb per sample: 7.17 minutes pooled: 7.36 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 50 0 5 8 1 per_sample INDELS 6 0 7 2 1 pooled SNPS 49 0 6 8 1 pooled INDELS 5 0 8 2 1 100 kb per sample: 140.00 minutes pooled: 145.27 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 144 0 22 28 1 per_sample INDELS 28 1 16 9 11 pooled SNPS 143 0 23 28 1 pooled INDELS 27 1 17 9 11 java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 8 seconds haplotypes from pools: 8 seconds java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 173.32 seconds haplotypes from pools: 167.12 seconds	2013-04-16 09:42:03 -04:00
Mark DePristo	5a74a3190c	Improvements to the VariantRecalibrator R plots -- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad	2013-04-16 09:09:51 -04:00
Mark DePristo	564fe36d22	VariantRecalibrator's VQSR.vcf now contains NEG/POS labels -- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE -- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels	2013-04-16 09:09:47 -04:00
Mauricio Carneiro	9bfa5eb70f	Quick optimization to the PairHMM Problem -------- the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations. Solution --------- initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.	2013-04-14 23:25:33 -04:00
Mark DePristo	3144eae51c	UnifiedGenotyper bugfix: don't create haplotypes with 0 bases -- The PairHMM no longer allows us to create haplotypes with 0 bases. The UG indel caller used to create such haplotypes. Now we assign -Double.MAX_VALUE likelihoods to such haplotypes. -- Add integration test to cover this case, along with private/testdata BAM -- [Fixes #47523579]	2013-04-13 14:57:55 -04:00
Mauricio Carneiro	f11c8d22d4	Updating java 7 md5's to java 6 md5's	2013-04-13 08:21:48 -04:00
Mark DePristo	b32457be8d	Merge pull request #163 from broadinstitute/mc_hmm_caching_again Fix another caching issue with the PairHMM	2013-04-12 12:34:49 -07:00
Mauricio Carneiro	403f9de122	Fix another caching issue with the PairHMM The Problem ---------- Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result. Solution -------- The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes. Summarized Changes ------------------ * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0) * Updated related MD5's Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9	2013-04-12 14:52:45 -04:00
Mark DePristo	0e627bce93	Slight update to Path SW parameters. -- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.	2013-04-12 12:43:52 -04:00
Mark DePristo	50cdffc61f	Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons Key improvement --------------- -- The haplotype caller was producing unstable calls when comparing the following two haplotypes: ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble. ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches. -- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events. -- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all General code cleanup -------------------- -- Move Smith-Waterman to its own package in utils -- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers -- Update users of SW code to use new Parameter class -- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected -- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage	2013-04-11 18:22:55 -04:00
Mark DePristo	74196ff7db	Trivial BQSR bug fixes and improvement -- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam) -- Improve the error message when a unknown platform is found (integration test added)	2013-04-11 17:08:35 -04:00
Ryan Poplin	a507381a33	Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables. -- Previously would crash when a scatter/gather interval contained no usable data. -- Added unit test to cover this case.	2013-04-11 16:27:59 -04:00
Mark DePristo	fb86887bf2	Fast algorithm for determining which kmers are good in a read -- old algorithm was O(kmerSize * readLen) for each read. New algorithm is O(readLen) -- Added real unit tests for the addKmersFromReads to the graph. Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.	2013-04-11 09:54:22 -04:00
Mark DePristo	bf42be44fc	Fast DeBruijnGraph creation using the kmer counter -- The previous creation algorithm used the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to the graph add edge kmer1 -> kmer2 in the graph, if it's not present (does check) update edge count by 1 if kmer1 -> kmer2 already existed in the graph -- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)). This is actually pretty expensive because get and add edges is expensive in jgrapht. -- The new approach uses the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap for each kmer pair 1 and 2 in the hash counter add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map update edge count by count from map if kmer1 -> kmer2 already existed in the graph -- This algorithm ensures that we add very much fewer edges -- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself. -- Overall runtimes are greatly reduced using this algorith	2013-04-10 17:10:59 -04:00
Ryan Poplin	850be5e9da	Bug fix in SWPairwiseAlignment. -- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.	2013-04-10 16:04:37 -04:00
Mark DePristo	b115e5c582	Critical bugfix for CommonSuffixSplitter to avoid infinite loops -- The previous version would enter into an infinite loop in the case where we have a graph that looks like: X -> A -> B Y -> A -> B So that the incoming vertices of B all have the same sequence. This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph. Fixed and unit tested -- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging. So the GATK will always error out, rather than just go on forever -- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph. Equals here means that all vertices have equivalents in both graphs, as do all edges. If the two graphs are equal we stop simplifying. It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph. -- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly -- #resolves GSA-924 -- See https://jira.broadinstitute.org/browse/GSA-924 for more details -- Update MD5s due to change in assembly graph construction	2013-04-09 16:19:26 -04:00
Mark DePristo	51954ae3e5	HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model -- HC now throws a UserException if this model is provided. Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY	2013-04-09 15:18:42 -04:00
Mark DePristo	33ecec535d	Turn off the LD merging code by default -- It's just too hard to interpret the called variation when we merge variants via LD. -- Can now be turned on with -mergeVariantsViaLD -- Update MD5s	2013-04-09 10:08:06 -04:00
Mark DePristo	21410690a2	Address reviewer comments	2013-04-08 12:48:20 -04:00
Mark DePristo	caf15fb727	Update MD5s to reflect new HC algorithms and parameter values	2013-04-08 12:48:16 -04:00
Mark DePristo	6d22485a4c	Critical bugfix to ReduceRead functionality of the GATKSAMRecord -- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward. -- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.	2013-04-08 12:47:50 -04:00
Mark DePristo	5a54a4155a	Change key Haplotype default parameter values -- Extension increased to 200 bp -- Min prune factor defaults to 0 -- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging -- Active region trimming enabled by default	2013-04-08 12:47:50 -04:00
Mark DePristo	3a19266843	Fix residual merge conflicts	2013-04-08 12:47:50 -04:00
Mark DePristo	9c7a35f73f	HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph -- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models. Allowing cycles can be re-enabled with an HC command line switch.	2013-04-08 12:47:50 -04:00
Mark DePristo	5545c629f5	Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest	2013-04-08 12:47:49 -04:00
Mark DePristo	15461567d7	HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype -- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).	2013-04-08 12:47:49 -04:00
Mark DePristo	9b5c55a84a	LikelihoodCalculationEngine will now only use reads longer than the minReadLength, which is currently fixed at 20 bp	2013-04-08 12:47:49 -04:00
Mark DePristo	af593094a2	Major improvements to HC that trims down active regions before genotyping -- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes -- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java -- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler	2013-04-08 12:47:49 -04:00
Mark DePristo	4d389a8234	Optimizations for HC infrastructure -- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices -- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final -- Cache HashCode calculation in BaseVertex -- Better docs before the pruneGraph call	2013-04-08 12:47:49 -04:00
Mark DePristo	e916998784	Bugfix for head and tail merging code in SeqGraph -- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so. This would introduce large deletion events at the start / end of the assemblies. Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.	2013-04-08 12:47:48 -04:00
Mark DePristo	2aac9e2782	More efficient ZipLinearChains algorithm -- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go. Vastly more efficient than the previous version, but unfortunately doesn't actually work now -- Also incorporate edge weight propagation into SeqGraph zipLinearChains. The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges	2013-04-08 12:47:48 -04:00
Mark DePristo	f1d772ac25	LD-based merging algorithm for nearby events in the haplotypes -- Moved R^2 LD haplotype merging system to the utils.haplotype package -- New LD merging only enabled with HC argument. -- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable -- A few misc. bug fixes in LDMerger itself -- Refactoring of Haplotype event splitting and merging code -- Renamed EventExtractor to EventMap -- EventMap has a static method that computes the event maps among n haplotypes -- Refactor Haplotype score and base comparators into their own classes and unit tested them -- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it. -- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful -- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population. -- Fixed fundamental bug in the way the previous R^2 implementation worked -- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are -- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events -- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested -- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled -- This new version does a much more specific test (that's actually right). Here's the new algorithm: * Compute probability that two variants are in phase with each other and that no * compound hets exist in the population. * * Implemented as a likelihood ratio test of the hypothesis: * * x11 and x22 are the only haplotypes in the populations * * vs. * * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population. * * Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22). * * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are: * * - P(x11 & x12 & x21) -- we have hom-ref and both hets * - P(x22 & x12 & x21) -- we have hom-alt and both hets * - P(x22 & x12) -- one haplotype is 22 and the other is het 12 * - P(x22 & x21) -- one haplotype is 22 and the other is het 21	2013-04-08 12:47:48 -04:00
Mark DePristo	67cd407854	The GenotypingEngine now uses the samples from the mapping of Samples -> PerReadAllele likelihoods instead of passing around a redundant list of samples	2013-04-08 12:47:47 -04:00
Mark DePristo	0310499b65	System to merge multiple nearby alleles into block substitutions -- Block substitution algorithm that merges nearby events based on distance. -- Also does some cleanup of GenotypingEngine	2013-04-08 12:47:47 -04:00
Mark DePristo	bff13bb5c5	Move Haplotype class to its own package in utils	2013-04-08 12:47:47 -04:00
Mauricio Carneiro	ebe2edbef3	Fix caching indices in the PairHMM Problem: -------- PairHMM was generating positive likelihoods (even after the re-work of the model) Solution: --------- The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM. Summarized Changes: ------------------ * Redesign the matrices to have only 1 col/row of padding instead of 2. * PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start) * Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version) * Adjust the prior and probability matrices to be one based (logless) * Update Log10PairHMM to work with prior and probability matrices as well * Move prior and probability matrices to parent class * Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations * Simple cleanup of PairHMMUnitTest class for a little speedup * Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base). * Create static indices for the transition probabilities (for better readability) [fixes #47399227]	2013-04-08 11:05:12 -04:00
Eric Banks	6253ba164e	Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs * This occurred when one or more alleles were lost from the record after selection * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718 * Added some integration tests for --keepOriginalAC (there were none before)	2013-04-05 00:53:28 -04:00
Eric Banks	7897d52f32	Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec). * As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270 * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF * Added integration test to confirm failure with User Error * Removed illegal header line in KB test VCF that was causing related tests to fail.	2013-04-05 00:52:32 -04:00
Ryan Poplin	8a93bb687b	Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists. -- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create duplicate active regions. -- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside of the current interval. -- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.	2013-04-03 13:15:30 -04:00
Mark DePristo	bb42c90f2b	Use LinkedHashSets in incoming and outgoing vertex functions in BaseGraph -- Using a LinkedHashSet changed the md5 for HCTestComplexVariants.	2013-04-02 17:58:20 -04:00
David Roazen	b4b58a3968	Fix unprintable character in a comment from the BaseEdge class Compiler warnings about this were starting to get to me...	2013-04-02 14:24:23 -04:00
Mark DePristo	c191d7de8c	Critical bugfix for CommonSuffixSplitter -- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm. Created unit test that reproduced the issue, and then fixed the underlying issue.	2013-04-02 09:22:33 -04:00
Ryan Poplin	a58a3e7e1e	Merge pull request #134 from broadinstitute/mc_phmm_experiments PairHMM rework	2013-04-01 12:10:43 -07:00
Ryan Poplin	f65206e758	Two changes to HC GGA mode to make it more like the UG. -- Only try to genotype PASSing records in the alleles file -- Don't attempt to genotype multiple records with the same start location. Instead take the first record and throw a warning message.	2013-04-01 10:20:23 -04:00
Mark DePristo	7c83efc1b9	Merge pull request #135 from broadinstitute/mc_pgtag_fix Fixing @PG tag uniqueness issue	2013-03-31 11:36:40 -07:00
Eric Banks	7dd58f671f	Merge pull request #132 from broadinstitute/gda_filter_unmasked_sites Added small feature to VariantFiltration to filter sites outside of a gi...	2013-03-31 06:27:26 -07:00
Guillermo del Angel	9686e91a51	Added small feature to VariantFiltration to filter sites outside of a given mask: -- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask. -- Added integration test, and made sure we also test with a BED rod.	2013-03-31 08:48:16 -04:00
Eric Banks	8e2094d2af	Updated AssessReducedQuals and applied it systematically to all ReduceReads integration tests. * Moved to protected for packaging purposes. * Cleaned up and removed debugging output. * Fixed logic for epsilons so that we really only test significant differences between BAMs. * Other small fixes (e.g. don't include low quality reduced reads in overall qual). * Most RR integration tests now automatically run the quals test on output. * A few are disabled because we expect them to fail in various locations (e.g. due to downsampling).	2013-03-31 00:27:14 -04:00
Mauricio Carneiro	ec475a46b1	Fixing @PG tag uniqueness issue The Problem: ------------ the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records". How this fixes it: ------------------ This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec. Summarized changes: ------------------- * Remove keep_all_pg_records option from setupWriter utility methos in Util * Update all walkers to now replace the last @PG tag of the same walker (if it already exists) * Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable * Simplify the multiple implementations to setupWriter Bamboo: ------- http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31 Issue Tracker: -------------- [fixes 47100885]	2013-03-30 20:31:33 -04:00
Mauricio Carneiro	68bf470524	making LoglessPairHMM final	2013-03-30 20:00:45 -04:00
Guillermo del Angel	6b8bed34d0	Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work. -- Corrected logic to pick biallelic vc to left align. -- Added integration test to make sure this feature is tested and feature to trim bases is also tested.	2013-03-30 19:31:28 -04:00
Mauricio Carneiro	0de6f55660	PairHMM rework The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because: # Initial conditions were not being set properly # Emission probabilities in the last row were not adding up to 1 The following commit fixes both by # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability) # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state) Summarized changes: * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1. * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now) * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity * Rename metric lengths to read and haplotype lengths for clarity * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix * Remove unnecessary parameters from updateCell() * Fix the expected probabilities coming from the exact model in PairHMMUnitTest * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables) * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s. [fix 47164949]	2013-03-30 10:50:06 -04:00
Chris Hartl	74a17359a8	MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods that are tested), resulting in slightly different numbers of calls to the RNG, and ultimately different sets of selected variants. This commits updates the md5 values for the validation site selector integration test to reflect these new random subsets of variants that are selected.	2013-03-29 14:52:10 -04:00
Guillermo del Angel	8fbf9c947f	Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production: -- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records. -- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels. -- Motivated by this, renamed walker to LeftAlignAndTrimVariants. -- Code refactoring, cleanup and bring up to latest coding standards. -- Added unit testing to make sure left alignment is performed correctly for all offsets. -- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.	2013-03-29 10:02:06 -04:00
Chris Hartl	73d1c319bf	Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils Currently, the multi-allelic test is covering the following case: Eval A T,C Comp A C reciprocate this so that the reverse can be covered. Eval A C Comp A T,C And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample. This comes from the fact that "truth" reference alleles can be paired with any allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine: Eval: A G,T 0/0 2/0 2/2 1/1 Comp: A C,T 0/0 1/0 0/0 0/0 Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows: Sample1: ref called ref Sample2: alleles don't match (the alt allele of the comp was not assessed in eval) Sample3: ref called hom-var Sample4: alleles don't match (the alt allele of the eval was not assessed in comp) Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference. Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed: - dotProduct(several types). logDotProduct is used extensively, but not the real-space version. - vectorSum - array shuffle, random subset - countOccurances (general forms, the char form is used in the codebase) - getNMaxElements - array permutation - sorted array permutation - compare floats - sum() (for integer arrays and lists). Final keyword was extensively added to MathUtils. The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired). The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself. In addition, more extensive tests were added for - logBinomialCoefficient (Newton's identity should always hold) - logFactorial - log10sumlog10 and its approximation All unit tests pass	2013-03-28 23:25:28 -04:00
MauricioCarneiro	a2b69790a6	Merge pull request #128 from broadinstitute/eb_rr_polyploid_compression_GSA-639	2013-03-28 06:39:43 -07:00
Mark DePristo	fde7d36926	Updating md5s due to changes in assembly graph creation algorithms and default parameter	2013-03-27 15:31:24 -04:00
Mark DePristo	197d149495	Increase the maxNumHaplotypesInPopulation to 25 -- A somewhat arbitrary increase, and will need some evaluation but necessary to get good results on the AFR integrationtest.	2013-03-27 15:31:24 -04:00
Mark DePristo	66910b036c	Added new and improved suffix and node merging algorithms -- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges. Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger. -- Refactored haplotype caller infrastructure into graphs package, code cleanup -- Cleanup new merging / splitting algorithms, with proper docs and unit tests -- Fix bug in zipping of linear chains. Because the multiplicity can be 0, protect ourselves with a max function call -- Fix BaseEdge.max unit test -- Add docs and some more unit tests -- Move error correct from DeBruijnGraph to DeBruijnAssembler -- Replaced uses of System.out.println with logger.info -- Don't make multiplicity == 0 nodes look like they should be pruned -- Fix toString of Path	2013-03-27 15:31:18 -04:00
Mark DePristo	39f2e811e5	Increase max cigar elements from SW before failing path creation to 20 from 6 -- This allows more diversity in paths, which is sometimes necessary when we cannot simply graphs that have large bubbles	2013-03-26 14:27:18 -04:00
Mark DePristo	b1b615b668	BaseGraph shouldn't implement getEdge -- no idea why I added this	2013-03-26 14:27:18 -04:00
Mark DePristo	a97576384d	Fix bug in the HC not respecting the requested pruning	2013-03-26 14:27:18 -04:00
Mark DePristo	78c672676b	Bugfix for pruning and removing non-reference edges in graph -- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base). This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph. Probably we should think hard about this. This commit fixes this logic, so we zip the graph between pruning -- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink. In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink. The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph. Other routines that use the same traversal idiom refactored to use this iterator. Added unit tests for all of these capabilities. -- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate	2013-03-26 14:27:18 -04:00
Mark DePristo	ad04fdb233	PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now -- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative. -- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)	2013-03-26 14:27:13 -04:00
Mark DePristo	2472828e1c	HC bug fixes: no longer create reference graphs with cycles -- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer. The previous test would fail to catch this case. Now its not possible -- Lots of code cleanup and refactoring in this push. Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph. -- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count. -- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear	2013-03-26 10:12:24 -04:00
Mark DePristo	1917d55dc2	Bugfix for DeBruijnAssembler: don't fail when read length > haplotype length -- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error	2013-03-26 10:12:17 -04:00
Mark DePristo	464e65ea96	Disable error correcting kmers by default in the HC -- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument. -- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods. -- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate -- Don't write out an errorcorrected graph when debugging and error correction isn't enabled	2013-03-26 10:05:17 -04:00
Eric Banks	593d3469d4	Refactored the het (polyploid) consensus creation in ReduceReads. * It is now cleaner and easier to test; added tests for newly implemented methods. * Many fixes to the logic to make it work * The most important change was that after triggering het compression we actually need to back it out if it creates reads that incorporated too many softclips at any one position (because they get unclipped). * There was also an off-by-one error in the general code that only manifested itself with het compression. * Removed support for creating a het consensus around deletions (which was broken anyways). * Mauricio gave his blessing for this. * Het compression now works only against known sites (with -known argument). * The user can pass in one or more VCFs with known SNPs (other variants are ignored). * If no known SNPs are provided het compression will automatically be disabled. * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their strandedness from normal reduced reads. * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded. * This allows us to update the FisherStrand annotation to count het compressed reduced reads towards the FS calculation. * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be backwards compatible.] * Updated integration tests accordingly with new het compressed bams (both for RR and UG). * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling RR properly, so I fixed it too. * Also, the test in the UG engine for determining whether there are too many overlapping deletions is updated to handle RR. * I added a special hook in the RR integration tests to additionally run the systematic coverage checking tool I wrote earlier. * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is not lost from original to reduced bam. * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads from all but 1 sample (now fixed). * AssessReducedCoverage moved from private to protected for packaging reasons. * #resolve GSA-639 At this point, this commit encompasses most of what is needed for het compression to go live. There are still a few TODO items that I want to get in before the 2.5 release, but I will save those for a separate branch because as it is I feel bad for the person who needs to review all these changes (sorry, Mauricio).	2013-03-25 09:34:54 -04:00
Mark DePristo	965043472a	Vastly more powerful, cleaner graph simplification approach -- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node. -- This new suite of algorithms is fully integrated into the HC, replacing previous approaches -- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.	2013-03-23 17:40:55 -04:00
Ryan Poplin	c15453542e	Merge pull request #124 from broadinstitute/md_hc_lowmapq_read_filter HC now by default only uses reads with MAPQ >= 20 for assembly and calli...	2013-03-21 12:00:28 -07:00
Mark DePristo	7ae15dadbe	HC now by default only uses reads with MAPQ >= 20 for assembly and calling -- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB -- Update MD5s to reflect changes due to no longer including mmq < 20 by default	2013-03-21 13:10:50 -04:00
Ryan Poplin	b9c331c2fa	Bug fix in HC gga mode. -- Don't try to test alleles which haven't had haplotypes assigned to them	2013-03-21 11:02:41 -04:00
Mark DePristo	aa7f172b18	Cap the computational cost of the kmer based error correction in the DeBruijnGraph -- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used. -- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms -- Updating MD5s to reflect changes in assembly algorithms	2013-03-21 09:21:35 -04:00
Mark DePristo	d94b3f85bc	Increase NUM_BEST_PATHS_PER_KMER_GRAPH in DeBruijnAssembler to 25 -- The value of 11 was too small to properly return a real low-frequency variant in our the 1000G AFR integration test.	2013-03-20 22:54:38 -04:00
Mark DePristo	6d7d21ca47	Bugfix for incorrect branch diamond merging algorithm -- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase -- Resolves https://jira.broadinstitute.org/browse/GSA-884	2013-03-20 22:54:37 -04:00
Mark DePristo	3a8f001c27	Misc. fixes upon pull request review -- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!) -- Remove the maxHaplotypesToConsider feature of HC as it's not useful	2013-03-20 22:54:37 -04:00
Mark DePristo	d3b756bdc7	BaseVertex optimization: don't clone byte[] unnecessarily -- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive	2013-03-20 22:54:37 -04:00
Mark DePristo	5226b24a11	HaplotypeCaller instructure cleanup and unit testing -- UnitTest for isRootOfDiamond along with key bugfix detected while testing -- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming -- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs -- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence -- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself -- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code. -- Near final cleanup of the current codebase -- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint	2013-03-20 22:54:37 -04:00
Mark DePristo	2e36f15861	Update md5s to reflect new downsampling and assembly algorithm output -- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome	2013-03-20 22:54:37 -04:00
Mark DePristo	1fa5050faf	Cleanup, unit test, and optimize KBestPaths and Path -- Split Path from inner class of KBestPaths -- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation -- Path now properly typed throughout the code -- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast	2013-03-20 22:54:36 -04:00
Mark DePristo	98c4cd060d	HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes. -- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code -- An a HaplotypeCaller command line option to use low-quality bases in the assembly -- Refactored DeBruijnGraph and associated libraries into base class -- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes. -- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct -- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split -- Moved generic methods in DeBruijnAssembler into BaseGraph -- Created a simple SeqGraph that contains SeqVertex objects -- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs -- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z -- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges -- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification. -- Debugging makes it easier to tell which kmer graph contributes to a haplotype -- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector -- Remove unnecessary printing of cleaning info in BaseGraph -- Turn off kmer graph creation in DeBruijnAssembler.java -- Only print SeqGraphs when debugGraphTransformations is set to true -- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality. -- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs -- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now	2013-03-20 22:54:36 -04:00
Mark DePristo	0f4328f6fe	Basic kmer error correction algorithm xfor the HaplotypeCaller -- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away -- The assembler logic is now: build initial graph, error correct, merge nodes, prune dead nodes, merge again, make haplotypes. The * elements are new -- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing. -- Easier way to control the testing of the graph assembly algorithms -- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler -- Simple protected parsing function for making DeBruijnAssemblyGraph -- Change the default prune factor for the graph to 1, from 2 -- debugging graph transformations are controllable from command line	2013-03-20 22:54:36 -04:00
Mark DePristo	53a904bcbd	Bugfix for HaplotypeCaller: GSA-822 for trimming softclipped reads -- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added -- See https://jira.broadinstitute.org/browse/GSA-822 -- #resolve #fix GSA-822	2013-03-20 22:54:36 -04:00
Mark DePristo	ffea6dd95f	HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping -- Added a -dontGenotype mode for testing assembly efficiency -- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted	2013-03-20 22:54:36 -04:00
Mark DePristo	a783f19ab1	Fix for potential HaplotypeCaller bug in annotation ordering -- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added. -- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine	2013-03-20 22:54:35 -04:00
Eric Banks	1fae750ebe	Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache Clear ReduceReads name cache after each set of reads produced by ReduceR...	2013-03-20 19:47:42 -07:00
Guillermo del Angel	ea01dbf130	Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers. In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up. Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site. -- Added unit test to UG engine to cover this case. -- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes). GSA-878 #resolve	2013-03-20 14:30:37 -04:00
Geraldine Van der Auwera	95a9ed853d	Made some documentation updates & fixes --Mostly doc block tweaks --Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.	2013-03-20 06:15:20 -04:00
Alec Wysoker	bccc9d79e5	Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash. Name cache was filling up with names of all reads in entire file, which for large file eventually consumes all of memory. Only keep read name cache for the reads that are together in one variant region, so that a pair of reads within the same variant region will still be joined via read name. Otherwise the ability to connect a read to its mate is lost. Update MD5s in integration test to reflect altered output. Add new integration test that confirms that pair within variant region is joined by read name.	2013-03-19 14:12:33 -04:00
Ryan Poplin	0cf5d30dac	Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes.	2013-03-15 14:20:25 -04:00
Ryan Poplin	b8991f5e98	Fix for edge case bug of trying to create insertions/deletions on the edge of contigs. -- Added integration test using MT that previously failed	2013-03-15 12:32:13 -04:00
Mark DePristo	2d35065238	QualityByDepth remaps QD values > 40 to a gaussian around 30 -- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads. In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value. This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR. -- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm -- Update md5s	2013-03-14 16:09:41 -04:00
droazen	0fd9f0e77c	Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837 Fixed the logic of the @Output annotation and its interaction with 'required'	2013-03-14 12:52:00 -07:00
Ryan Poplin	38914384d1	Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878.	2013-03-14 14:44:18 -04:00
Geraldine Van der Auwera	61349ecefa	Cleaned up annotations - Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden - Removed unnecessary check in AverageAltAlleleLength	2013-03-14 14:26:48 -04:00
Eric Banks	7cab709a88	Fixed the logic of the @Output annotation and its interaction with 'required'. ALL GATK DEVELOPERS PLEASE READ NOTES BELOW: I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag. * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided. * The logic for @Output is now: * if required==true then -o MUST be provided or a User Error is generated. * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided. * this is the default behavior (i.e. @Output with no modifiers). * if required==false and defaultToStdout==false then the output object is null. * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878). * I have updated walkers so that previous behavior has been maintained (as best I could). * In general, all @Outputs with default long/short names have required=false. * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this) * I added unit tests for @Output changes with David's help (thanks!). * #resolve GSA-837	2013-03-14 11:58:51 -04:00
Mark DePristo	b5b63eaac7	New GATKSAMRecord concept of a strandless read, update to FS -- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests -- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments -- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together. -- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone. -- Update MD5s for change to FS calculation. Differences are just minor updates to the FS	2013-03-13 11:16:36 -04:00
MauricioCarneiro	4403e3572a	Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111	2013-03-12 13:02:35 -07:00
MauricioCarneiro	3a16ba04d4	Merge pull request #97 from broadinstitute/eb_refactor_sliding_window Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug	2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera	f972963918	Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) GATK-73 updated docs for bqsr args GATK-9 differentiate CountRODs from CountRODsByRef GATK-76 generate GATKDoc for CatVariants GATK-4 made resource arg required GATK-10 added -o, some docs to CountMales; some docs to CountLoci GATK-11 fixed by MC's -o change; straightened out the docs. GATK-77 fixed references to wiki GATK-76 Added Ami's doc block GATK-14 Added note that these annotations can only be used with VariantAnnotator GATK-15 specified required=false for two arguments GATK-23 Added documentation block GATK-33 Added documentation GATK-34 Added documentation GATK-32 Corrected arg name and docstring in DiffObjects GATK-32 Added note to DO doc about reference (required but unused) GATK-29 Added doc block to CountIntervals GATK-31 Added @Output PrintStream to enable -o GATK-35 Touched up docs GATK-36 Touched up docs, specified verbosity is optional GATK-60 Corrected GContent annot module location in gatkdocs GATK-68 touched up docs and arg docstrings GATK-16 Added note of caution about calling RODRequiringAnnotations as a group GATK-61 Added run requirements (num samples, min genotype quality) Tweaked template and generic doc block formatting (h2 to h3 titles) GATK-62 Added a caveat to HR annot Made experimental annotation hidden GATK-75 Added setup info regarding BWA GATK-22 Clarified some argument requirements GATK-48 Clarified -G doc comments GATK-67 Added arg requirement GATK-58 Added annotation and usage docs GSATDG-96 Corrected doc Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)	2013-03-12 10:57:14 -04:00
Eric Banks	05e69b6294	Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug. * Allow RR to write its BAM to stdout by setting required=true for @Output. * Fixed bug in sliding window where a break in coverage after a long stretch without a variant region was causing a doubling of all the reads before the break. * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods. * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().	2013-03-12 09:06:55 -04:00
Ryan Poplin	c96fbcb995	Use the indel heterozygosity prior when calling indels with the HC	2013-03-11 14:12:43 -04:00
Guillermo del Angel	695723ba43	Two features useful for ancient DNA processing. Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly. Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert. If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence. -- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence. -- Unit tests added for trimming operation. -- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data. -- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs). Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced. -- Added -noPrior argument to StandardCallerArgumentCollection. -- Added option not to fill priors is such argument is set. -- Added an integration test.	2013-03-09 18:18:13 -05:00
Yossi Farjoun	baad965a57	- Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed. - This was needed since samples with spaces in their names are regularly found in the picard pipeline. - Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly) - Cleaned up the unit tests using a @DataProvider (I'm in love...). - Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)	2013-03-07 13:04:24 -05:00
Eric Banks	3759d9dd67	Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine. * ReadTransformers can say they must be first, must be last, or don't care. * By default, none of the existing ones care about ordering except BQSR (must be first). * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR. * The engine now orders the read transformers up front before applying iterators. * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully). * Added unit tests.	2013-03-06 12:38:59 -05:00
Menachem Fromer	928f646afd	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-03-06 10:01:29 -05:00
Eric Banks	78721ee09b	Added new walker to split MNPs into their allelic primitives (SNPs). * Can be extended to complex alleles at some point. * Currently only works for bi-allelics (documented). * Added unit and integration tests.	2013-03-05 23:16:42 -05:00
Eric Banks	bbbaf9ad20	Revert push from stable (I forgot that pushing from stable overwrites current unstable changes)	2013-03-05 09:06:02 -05:00
Eric Banks	a037423225	Merged bug fix from Stable into Unstable	2013-03-05 09:03:48 -05:00
Eric Banks	7e1bfd6a7c	Included an accidental change from unstable into the previous push	2013-03-05 09:03:31 -05:00
Eric Banks	bd4e4f4ee3	Merged bug fix from Stable into Unstable	2013-03-04 23:24:44 -05:00
Eric Banks	b715218bfe	Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals.	2013-03-04 23:23:18 -05:00
Ryan Poplin	ce7554e9d6	Merged bug fix from Stable into Unstable	2013-03-04 12:36:04 -05:00
Ryan Poplin	0697594778	Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException.	2013-03-04 12:35:40 -05:00
Mark DePristo	42d3919ca4	Expanded functionality for writing BAMs from HaplotypeCaller -- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV. -- Previous functionality maintained, with bug fixes -- Haplotype BAM writing code now lives in utils -- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes. -- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes -- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best -- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars -- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils -- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization. -- Added extensive docs to HaplotypeCaller on how to use this capability -- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC -- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW -- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s -- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case -- Writes out haplotypes for both all haplotype and called haplotype mode -- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter	2013-03-03 12:07:29 -05:00
David Roazen	c5c99c8339	Split long-running integration test classes into multiple classes This is to facilitate the current experiment with class-level test suite parallelism. It's our hope that with these changes, we can get the runtime of the integration test suite down to 20 minutes or so. -UnifiedGenotyper tests: these divided nicely into logical categories that also happened to distribute the runtime fairly evenly -UnifiedGenotyperPloidy: these had to be divided arbitrarily into two classes in order to halve the runtime -HaplotypeCaller: turns out that the tests for complex and symbolic variants make up half the runtime here, so merely moving these into a separate class was sufficient -BiasedDownsampling: most of these tests use excessively large intervals that likely can't be reduced without defeating the goals of the tests. I'm disabling these tests for now until they can either be redesigned to use smaller intervals around the variants of interest, or refactored into unit tests (creating a JIRA for Yossi for this task)	2013-03-01 13:55:23 -05:00
depristo	cac3f80c64	Merge pull request #73 from broadinstitute/eb_remove_nested_hashmap_GSA-732 Replace uses of NestedHashMap with NestedIntegerArray.	2013-02-28 05:19:56 -08:00
Eric Banks	d2904cb636	Update docs for RTC.	2013-02-27 14:56:44 -05:00
Eric Banks	69b8173535	Replace uses of NestedHashMap with NestedIntegerArray. * Removed from codebase NestedHashMap since it is unused and untested. * Integration tests change because the BQSR CSV is now sorted automatically. * Resolves GSA-732	2013-02-27 14:03:39 -05:00
Alec Wysoker	c8368ae2a5	Eliminate 7-element arrays in BaseCounts and BaseAndQualsCount and replace with in-line primitive attributes. This is ugly but reduces heap overhead, and changes are localized. When used in conjunction with Mauricio's FastUtil changes it saves and additional 9% or so of execution time.	2013-02-27 12:49:56 -05:00
David Roazen	752f4335a5	Merged bug fix from Stable into Unstable	2013-02-27 05:20:41 -05:00
David Roazen	2a7af43164	Fix improper dependencies in QScripts used by pipeline tests, and attempt to fix the flawed MisencodedBaseQualityUnitTest -Some QScripts used by public pipeline tests unnecessarily used the (now protected) UnifiedGenotyper. Changed them to use PrintReads instead. -Moved ExampleUnifiedGenotyperPipelineTest to protected -Attempt to fix the flawed and sporadically failing MisencodedBaseQualityUnitTest: After looking at this class a bit, I think the problem was the use of global arrays for the quals shared across all reads in all tests (BAMRecord class definitely does not make a separate copy for each read!). One test (testFixBadQuals) modifies the bad quals array, and if this happens to run before the testBadQualsThrowsError test the bad quals array will have been "fixed" and no exception will be thrown.	2013-02-27 04:45:53 -05:00
David Roazen	6466463d5a	Merged bug fix from Stable into Unstable	2013-02-26 21:54:54 -05:00
David Roazen	12a3d7ecad	Fix licenses on files modified in 2.4-1	2013-02-26 21:53:17 -05:00
David Roazen	a53b4a7521	Merged bug fix from Stable into Unstable	2013-02-26 21:41:13 -05:00
David Roazen	65d31ba4ad	Fix runtime public -> protected dependencies in the test suite -replace unnecessary uses of the UnifiedGenotyper by public integration tests with PrintReads -move NanoSchedulerIntegrationTest to protected, since it's completely dependent on the UnifiedGenotyper	2013-02-26 21:19:12 -05:00
depristo	93205154b5	Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776 Eb fix pairhmm unittest gsa 776	2013-02-26 11:56:58 -08:00
Eric Banks	734353e9df	Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83 Brought all of ReduceReads to fastutils	2013-02-26 11:56:41 -08:00
David Roazen	8b29030467	Change default downsampling coverage target for the HaplotypeCaller to 250 -was previously set to 30, which seems far too aggressive given that with ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any pileup returned by LIBS -250 is a more conservative default used by the UG -can adjust down/up later based on further experiments (GSA-699 will remain open) -verified with Ryan that all integration test differences are either innocent or represent an improvement GSA-699	2013-02-26 09:33:25 -05:00
Eric Banks	396b7e0933	Fixed the intermittent PairHMM unit test failure. The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of that class; so if the optimized tests were run first we were unintentionally running those same tests again with the basic ones (but expecting different results).	2013-02-25 15:05:13 -05:00
Eric Banks	7519484a38	Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods.	2013-02-25 15:04:23 -05:00
Ryan Poplin	89e2943dd1	The maximum kmer length is derived from the reads. -- This is done to take advantage of longer reads which can produce less ambiguous haplotypes -- Integration tests change for HC and BiasedDownsampling	2013-02-25 14:40:25 -05:00
Mauricio Carneiro	0ff3343282	Addressing Eric's comments -- added @param docs to the new variables -- made all variables final -- switched to string builder instead of String for performance. GSATDG-83	2013-02-25 13:33:47 -05:00
Mauricio Carneiro	9e5a31b595	Brought all of ReduceReads to fastutils -- Added unit tests to ReduceReads name compression -- Updated reduce reads walker for unit testing GSATDG-83	2013-02-23 22:53:23 -05:00
Ryan Poplin	6a639c8ffc	Replace Smith-Waterman alignment with the bubble traversal. -- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph. -- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime. -- Refactoring graph functions into a new DeBruijnAssemblyGraph class. -- Bug fix in path.getBases(). -- Adding validation code to the assembly engine. -- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class. -- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes. -- Added ability to ignore bubbles that are too divergent from the reference -- Max kmer can't be bigger than the extension size. -- Reverse the order that we create the assembly graphs so that the bigger kmers are used first. -- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment. -- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation. -- Updating HaplotypeCaller and BiasedDownsampling integration tests. -- Rebased everything into one commit as requested by Eric -- improvements to the bubble traversal are coming as a separate push	2013-02-22 15:42:16 -05:00
Mauricio Carneiro	e3f01673e1	Implementation of the find and diagnose Queue script -- Added 'uncovered intervals' output for FindCoveredIntervals -- updated scala script to make use of it.	2013-02-22 10:19:01 -05:00
Ryan Poplin	62e14f5b58	Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores.	2013-02-21 14:34:17 -05:00
Eric Banks	6996a953a8	Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample). These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons): * Don't unnecessarily overload Allele.getBases() in the Haplotype class. * Haplotype.getBases() was calling clone() on the byte array. * Added a constructor to Allele (and Haplotype) that takes in an Allele as input. * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated). * Rev'ed the variant jar accordingly. For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.	2013-02-21 10:14:11 -05:00
Eric Banks	551d33686c	Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761 Reduce memory footprint of SyntheticRead by replacing several Lists with...	2013-02-20 04:49:07 -08:00
Eric Banks	9dfdb9528b	Merge pull request #49 from broadinstitute/gda_hidden_ug_args Hide arguments related to reference sample operation in UG - for interna...	2013-02-19 16:18:32 -08:00
Eric Banks	0055a6f1cd	Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774 Fix to the Indel Realigner bug described in GSA-774	2013-02-19 16:16:48 -08:00
Guillermo del Angel	5a0a9bc488	Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished.	2013-02-19 19:06:42 -05:00
Mauricio Carneiro	371ea2f24c	Fixed IndelRealigner reference length bug (GSA-774) -- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases -- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases -- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them) -- pulled out ReadBin class so it can be testable -- added unit tests for ReadBin with soft-clips -- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads GSA-774 #resolve	2013-02-19 16:00:36 -05:00
Alec Wysoker	ab75e053da	Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static class that contains the attributes that were scattered across the several Lists.	2013-02-19 15:33:33 -05:00
Ryan Poplin	c025e84c8b	Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping. -- Updated a few integration tests for HC, UG, and UG general ploidy	2013-02-19 14:09:24 -05:00
Mark DePristo	be45edeff2	ActivityProfile and ActiveRegions respects engine interval boundaries -- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present. -- UnitTests for ActiveRegion.splitAndTrimToIntervals -- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals -- UnitTesting overlap function in GenomeLocSortedSet -- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775 -- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals -- Added docs for the constructors in GLSS -- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s -- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser	2013-02-18 10:40:25 -05:00
Ryan Poplin	b7e9c342c7	Reducing the size of the reference padding in the HaplotypeCaller.	2013-02-17 11:09:00 -05:00
Mark DePristo	73a363b166	Update MD5s due to new QualityUtils calculations -- Increase the allowed runtime of one UG integration test -- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before. Some updates can push this right over the edge. Increased limit -- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster -- Updating MD5s due to more correct quality utils. DuplicatesWalkers quality estimates have changed. One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different	2013-02-16 07:31:38 -08:00
Mark DePristo	3231031c1a	Bugfix for FisherStrand -- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score	2013-02-16 07:31:38 -08:00
Mark DePristo	9a29d6d4be	Fix an catastrophic bug (WoW!) in the reference calculation of the UG -- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test. -- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.	2013-02-16 07:31:38 -08:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
MauricioCarneiro	d80b99143f	Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771	2013-02-15 08:32:45 -08:00
MauricioCarneiro	1dd284a5bb	Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720 PrintReads writes a header when used with -BQSR	2013-02-15 07:18:28 -08:00
MauricioCarneiro	b58a0eca6b	Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62 Refactored GATKDocs categories some more ( GSATDG-62 )	2013-02-14 22:35:07 -08:00
Tad Jordan	6cb80591e3	PrintReads writes a header when used with -BQSR	2013-02-14 22:19:14 -05:00
Guillermo del Angel	b18f216033	Updated md5's from BiasedDownsamplerIntegrationTest that changed due to changes in HaplotypeCaller - changing HashMaps to LinkedHashMaps changed ordering of reads presented to BiasedDownSampler which changed reads chosen, thereby marginally changing PL's and some site info.	2013-02-14 20:18:49 -05:00
Ryan Poplin	871c8b3866	No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference.	2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
depristo	357d196dad	Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765 Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.	2013-02-13 10:08:11 -08:00
Yossi Farjoun	6d12e5a54f	Fixed md5s for the per-sample downsampling IntegrationTests that were disabled. - got md5s from a interim version that does not have the per-sample downsampling hookedup - added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file	2013-02-13 12:49:39 -05:00
Guillermo del Angel	4308b27f8c	Fixed non-determinism in HaplotypeCaller and some UG calls - -- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads. -- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast) -- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling). -- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test -- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).	2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera	dff5ef562b	Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details) -- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools -- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities -- Reworded some category names and descriptions to be more explicit and user-friendly	2013-02-12 13:36:15 -05:00
Ryan Poplin	3f2f837b6a	Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads.	2013-02-11 11:36:09 -05:00
Mark DePristo	b4417dff5b	Updating MD5s due to changes in HMM -- New HMM has two impacts on MD5s. First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed. This is for the good, especially given the computational cost of this annotationa and unclear value for HC. Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods. -- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix -- Disabled the non-deterministic GGA test. Assigned JIRA to Guillermo -- With this push I expect all integration tests to pass	2013-02-09 19:19:28 -05:00
Mark DePristo	35139cf990	HaplotypeScore only annotates SNPs -- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer. This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller. So I'm simply not going to emit this annotation value any longer for indels and for the HC	2013-02-09 19:19:28 -05:00
Mark DePristo	e40d83f00e	Final version of PairHMMs with correct edge conditions -- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites -- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done -- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow -- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact -- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC -- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now -- Added locs of docs -- Generalize unit tests with multiple equivalent matches of read to haplotype -- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10 -- Functions to dumpMatrices for debugging -- Fix nasty bug (without original unit tests) in LoglessPairHMM -- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again. -- Added dupString(string, n) method to Utils -- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes -- Unit tests for the hapStartIndex functionality of PairHMM -- Moved computeFirstDifferingPosition to PairHMM, and added unit tests -- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10 -- Still TODOs left in the code that I'll fix up -- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so -- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal -- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors	2013-02-09 19:19:22 -05:00
Mark DePristo	09595cdeb9	Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments	2013-02-09 13:06:54 -05:00
Mark DePristo	2d802e17a4	Delete the CachingPairHMM	2013-02-09 13:06:54 -05:00
Mark DePristo	7dcafe8b81	Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods -- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs -- Actual working unit tests for PairHMMUnitTest -- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results -- PairHMM has protected variables used throughout the subclasses	2013-02-09 13:06:54 -05:00
Mark DePristo	7fb620dce7	Generalize and fixup ContigComparator -- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it) -- Added unit tests	2013-02-09 09:52:13 -05:00
Mauricio Carneiro	d004bfbe6f	walker to calculate per base coverage distribution -- Base distribution optionally includes deletions -- Implemented an optional filtered coverage distribution option -- Integration tests added for every feature of the traversal This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage. GSATDG-45 #resolve	2013-02-07 16:33:05 -05:00
Mauricio Carneiro	5f49c95cc1	Added distance across contigs calculation to GenomeLocs -- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader) -- unit test added GSATDG-45	2013-02-07 16:31:41 -05:00
depristo	cd4aec177a	Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761 Aw reduceread perf 1 gsa 761	2013-02-07 12:11:05 -08:00
Eric Banks	9826192854	Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class! * AlignmentUtils.getMismatchCount() * AlignmentUtils.calcAlignmentByteArrayOffset() * AlignmentUtils.readToAlignmentByteArray(). * AlignmentUtils.leftAlignIndel()	2013-02-07 13:04:24 -05:00
Alec Wysoker	e88bc753aa	Replace with map.containsKey followed by map.get with map.get followed by null check.	2013-02-07 11:58:41 -05:00
Alec Wysoker	72e496d6f3	Eliminate unnecessary zeroing out of primitive arrays immediately after new.	2013-02-07 11:57:43 -05:00
Eric Banks	481982202d	Fixing the failing RR integration tests. * After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk * However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s * Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now * Fixed the bam	2013-02-06 12:40:56 -05:00
Mark DePristo	59df329776	Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc -- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together. This saves us a lot of work on average -- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception	2013-02-06 10:34:09 -05:00
eitanbanks	584899329c	Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692 Replace org.broadinstitute.variant with jar built from the Picard repo	2013-02-06 07:22:30 -08:00
Eric Banks	562f2406d7	Added check that BaseRecalibrator is not being run on a reduced bam. - Throws user exception if it is. - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument. - Added test to check this is working. - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end. - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process. - Added comment about not changing the program record name, as per reviewer comments. - Removed unused variable.	2013-02-06 10:14:27 -05:00
Eric Banks	4e5ff3d6f1	Bug fix for NPE in HC with --dbsnp argument. - I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null. - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.	2013-02-05 21:59:19 -05:00
Eric Banks	e7c35a907f	Fixes to BQSR for the --maximum_cycle_value argument. - It's now written into the recal report so that it can be used in the PrintReads step. - Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before! - Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR. (This is my last non-branch commit; all future pushes will follow new GATK practices)	2013-02-05 17:38:03 -05:00
David Roazen	e7e76ed76e	Replace org.broadinstitute.variant with jar built from the Picard repo The migration of org.broadinstitute.variant into the Picard repo is complete. This commit deletes the org.broadinstitute.variant sources from our repo and replaces it with a jar built from a checkout of the latest Picard-public svn revision.	2013-02-05 17:24:25 -05:00
Ryan Poplin	cb2dd470b6	Moving the random number generator over to using GenomeAnalysisEngine.getRandomGenerator in the logless versus exact pair hmm unit test. We don't believe this will fix the problem with the non-deterministic test failures but it will give us more information the next time it fails.	2013-02-05 12:56:20 -05:00
MauricioCarneiro	050c4794a5	Merge pull request #11 from yfarjoun/per_sample2 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...	2013-02-05 08:04:29 -08:00
Eric Banks	23c6aee236	Added in some basic unit tests for polyploid consensus creation in RR. - Uncovered small bug in the fix that I added yesterday, which is now fixed properly. - Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]). - Need to consult Mauricio on what to do here (are we supporting het compression for deletions? (Insertions are definitely not supported)	2013-02-05 10:35:45 -05:00
Yossi Farjoun	de03f17be4	-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling. -Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added -includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test. -(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper. -New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map). -Unit tests to the new class and function are provided. -Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly. -Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.	2013-02-04 18:24:36 -05:00
Eric Banks	70f3997a38	More RR tests and fixes. * Fixed implementation of polyploid (het) compression in RR. * The test for a usable site was all wrong. Worked out details with Mauricio to get it right. * Added comprehensive unit tests in HeaderElement class to make sure this is done right. * Still need to add tests for the actual polyploid compression. * No longer allow non-diploid het compression; I don't want to test/handle it, do you? * Added nearly full coverage of tests for the BaseCounts class.	2013-02-04 15:55:15 -05:00
Ryan Poplin	79ef41e7b1	Added some docs, unit test, and contracts to SimpleDeBruijnAssembler. -- Testing that cycles in the reference graph fail graph construction appropriately. -- Minor bug fix in assembly with reduced reads. Added some docs and contracts to SimpleDeBruijnAssembler Added a unit test to SimpleDeBruijnAssembler	2013-02-04 15:17:22 -05:00
Geraldine Van der Auwera	43e3a040b6	Updated UnifiedGenotyper GATKDoc (note on ploidy model)	2013-02-04 14:18:56 -05:00
Chris Hartl	41a030f4b7	Apparently I'm a failure at rebasing...there should have been only one commit message to write. But whatever, here it is again: Part 1 of Variant Annotator Unit tests: PerReadAlleleLikelihoodMap - Added contract enforcement for public methods - Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method - added method documentation for non getters/setters - finals, finals everywhere - Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.	2013-02-04 14:16:28 -05:00
Ryan Poplin	d9fd89ecaa	Somehow these md5 updates got lost in my previous git rebase disaster. Sorry for the trouble.	2013-02-04 13:26:18 -05:00
Eric Banks	2d518f3063	More RR-related updates and tests. - ReduceReads by default now sets up-front ReadWalker downsampling to 40x per start position. - This is the value I used in my tests with Picard to show that memory issues pretty much disappeared. - This should hopefully take care of the memory issues being reported on the forum. - Added javadocs to SlidingWindow (the main RR class) to follow GATK conventions. - Added more unit tests to increase coverage of BaseCounts class. - Added more unit tests to test I/D operators in the SlidingWindow class.	2013-02-04 12:57:43 -05:00
Menachem Fromer	9b77cdec4b	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-02-04 12:03:42 -05:00
Guillermo del Angel	971ded341b	Swap java Random generator for GATK one to ensure test determinism	2013-02-04 10:57:34 -05:00
Guillermo del Angel	f31bf37a6f	First step in better BQSR unit tests for covariates (not done yet): more test coverage in basic covariates, test logging several read groups/read lengths and more combinations simultaneously. Add basic Javadocs headers for PerReadAlleleLikehoodMap.	2013-02-03 15:31:30 -05:00
Eric Banks	03df5e6ee6	- Added more comprehensive tests for consensus creation to RR. Still need to add tests for I/D ops. - Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure). - Also added minor cleanup of BaseUtils	2013-02-01 15:37:19 -05:00
Ryan Poplin	2fee000dba	Adding unit tests for KBestPaths class and fixing edge case bugs.	2013-02-01 13:51:31 -05:00
David Roazen	c6581e4953	Update MD5s to reflect version number change in the BAM header I've confirmed via a script that all of these differences only involve the version number bump in the BAM headers and nothing else: < @HD VN:1.0 GO:none SO:coordinate --- > @HD VN:1.4 GO:none SO:coordinate	2013-02-01 13:51:31 -05:00
Guillermo del Angel	a520058ef6	Add option to specify maximum STR length to RepeatCovariates from command line to ease testing	2013-02-01 13:51:31 -05:00
Mark DePristo	22f7fe0d52	Expanded unit tests for AlignmentUtils -- Added JIRA entries for the remaining capabilities to be fixed up and unit tested	2013-02-01 13:51:31 -05:00
Ryan Poplin	ac033ce41a	Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference.	2013-01-31 11:32:19 -05:00
Ryan Poplin	495bca3d1a	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-31 10:12:26 -05:00
Ryan Poplin	ca6968d038	Use base List and Map types in the GenotypingEngineUnitTest.	2013-01-31 10:12:18 -05:00
Eric Banks	75ceddf9e5	Adding new unit tests for RR. These tests took a frustratingly long time to get to pass, but now we have a framework for testing the adding of reads into the SlidingWindow plus consensus creation. Will flesh these out more after I take care of some other items on my plate.	2013-01-31 09:46:38 -05:00
Ryan Poplin	bb29bd7df7	Use base List and Map types in the HaplotypeCaller when possible.	2013-01-30 17:09:27 -05:00
Ryan Poplin	5f4a063def	Breaking up my massive commits into smaller pieces that I can successfully merge and digest. This one enables downsampling in the HaplotypeCaller (by lowering the default dcov to 20) and removes my long-standing, temporary region-based downsampling.	2013-01-30 16:14:07 -05:00
David Roazen	591df2be44	Move additional VariantContext utility methods back to the GATK Thanks to Eric for his feedback	2013-01-30 13:58:17 -05:00
Ryan Poplin	ff8ba03249	Updating BQSR integration test md5s to reflect the updates to the hierarchicalBayesianQualityEstimate function	2013-01-30 13:30:18 -05:00
Ryan Poplin	85dabd321f	Adding unit tests for hierarchicalBayesianQualityEstimate function	2013-01-30 13:26:07 -05:00
Ryan Poplin	07fe3dd1ef	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-30 13:19:24 -05:00
David Roazen	9985f82a7a	Move BaseUtils back to the GATK by request, along with associated utility methods	2013-01-30 13:09:44 -05:00

... 9 10 11 12 13 ...

1510 Commits (bc3b3ac0ec4b4fd72a9e856470edaeb4c7566a06)