gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	17982bcbf8	Update MD5s for VQSR header change	2013-04-16 11:45:45 -04:00
Ryan Poplin	0ee21e58c3	Merge pull request #165 from broadinstitute/md_vqsr_improvements Two simple VQSR usability improvements	2013-04-16 06:26:38 -07:00
Mark DePristo	5a74a3190c	Improvements to the VariantRecalibrator R plots -- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad	2013-04-16 09:09:51 -04:00
Mark DePristo	564fe36d22	VariantRecalibrator's VQSR.vcf now contains NEG/POS labels -- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE -- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels	2013-04-16 09:09:47 -04:00
Mark DePristo	ee51195bf5	Merge pull request #170 from broadinstitute/mc_hmm_mantissa_optimize Quick optimization to the PairHMM	2013-04-15 10:23:56 -07:00
Mauricio Carneiro	9bfa5eb70f	Quick optimization to the PairHMM Problem -------- the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations. Solution --------- initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.	2013-04-14 23:25:33 -04:00
MauricioCarneiro	55547b68bb	Merge pull request #169 from broadinstitute/md_ug_bugfix UnifiedGenotyper bugfix: don't create haplotypes with 0 bases	2013-04-13 12:05:53 -07:00
Mark DePristo	3144eae51c	UnifiedGenotyper bugfix: don't create haplotypes with 0 bases -- The PairHMM no longer allows us to create haplotypes with 0 bases. The UG indel caller used to create such haplotypes. Now we assign -Double.MAX_VALUE likelihoods to such haplotypes. -- Add integration test to cover this case, along with private/testdata BAM -- [Fixes #47523579]	2013-04-13 14:57:55 -04:00
delangel	6c9360b020	Merge pull request #167 from broadinstitute/gda_read_adaptor_trimmer_improvements Several improvements to ReadAdaptorTrimmer so that it can be incorporate...	2013-04-13 10:43:45 -07:00
Guillermo del Angel	a971e7ab6d	Several improvements to ReadAdaptorTrimmer so that it can be incorporated into ancient DNA processing pipelines (for which it was developed): -- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out. -- Add integration tests -- Move walker to public	2013-04-13 13:41:36 -04:00
Mark DePristo	a5301e17a2	Merge pull request #168 from broadinstitute/mc_update_example_grp Updating the exampleGRP.grp test file	2013-04-13 08:01:07 -07:00
Mauricio Carneiro	a063e79597	Updating the exampleGRP.grp test file It had been generated with an old version of BQSRv2 and wasn't compatible with exampleBAM anymore.	2013-04-13 09:07:13 -04:00
Mauricio Carneiro	f11c8d22d4	Updating java 7 md5's to java 6 md5's	2013-04-13 08:21:48 -04:00
Mark DePristo	776f5a2f6f	Merge pull request #164 from broadinstitute/mc_clia_scripts Clia Scripts	2013-04-12 13:08:18 -07:00
Mauricio Carneiro	09d29e5d0d	In HaplotypeCallerScript: * fix downsampling parameter * fix the default value of required fields (_ instead of .) * add support to multiple interval files	2013-04-12 15:54:54 -04:00
Mauricio Carneiro	802ae76905	Script for coverage evaluation of exomes and targeted sequencing projects in the Genomics Platform	2013-04-12 15:54:53 -04:00
Mark DePristo	b32457be8d	Merge pull request #163 from broadinstitute/mc_hmm_caching_again Fix another caching issue with the PairHMM	2013-04-12 12:34:49 -07:00
Mauricio Carneiro	403f9de122	Fix another caching issue with the PairHMM The Problem ---------- Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result. Solution -------- The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes. Summarized Changes ------------------ * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0) * Updated related MD5's Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9	2013-04-12 14:52:45 -04:00
Ryan Poplin	e5b9b6041c	Merge pull request #162 from broadinstitute/md_path_sw_update Slight update to Path SW parameters.	2013-04-12 10:38:26 -07:00
Mark DePristo	0e627bce93	Slight update to Path SW parameters. -- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.	2013-04-12 12:43:52 -04:00
Ryan Poplin	4ef6e0deb1	Merge pull request #161 from broadinstitute/md_unstable_sw Slightly improved Smith-Waterman parameter values for HaplotypeCaller Pa...	2013-04-12 07:59:43 -07:00
Mark DePristo	50cdffc61f	Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons Key improvement --------------- -- The haplotype caller was producing unstable calls when comparing the following two haplotypes: ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble. ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches. -- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events. -- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all General code cleanup -------------------- -- Move Smith-Waterman to its own package in utils -- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers -- Update users of SW code to use new Parameter class -- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected -- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage	2013-04-11 18:22:55 -04:00
Mark DePristo	35293cde49	Merge pull request #160 from broadinstitute/rp_bqsr_gatherer_works_with_empty_tables Updating BQSR RecalibrationEngine to work correctly with empty BQSR tabl...	2013-04-11 14:07:24 -07:00
Ryan Poplin	a507381a33	Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables. -- Previously would crash when a scatter/gather interval contained no usable data. -- Added unit test to cover this case.	2013-04-11 16:27:59 -04:00
delangel	44230b97eb	Merge pull request #158 from broadinstitute/hc_fast_kmer_graph_creation Performance improvements for building DeBruijn graphs	2013-04-11 12:09:01 -07:00
Mark DePristo	fb86887bf2	Fast algorithm for determining which kmers are good in a read -- old algorithm was O(kmerSize * readLen) for each read. New algorithm is O(readLen) -- Added real unit tests for the addKmersFromReads to the graph. Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.	2013-04-11 09:54:22 -04:00
Mark DePristo	bf42be44fc	Fast DeBruijnGraph creation using the kmer counter -- The previous creation algorithm used the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to the graph add edge kmer1 -> kmer2 in the graph, if it's not present (does check) update edge count by 1 if kmer1 -> kmer2 already existed in the graph -- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)). This is actually pretty expensive because get and add edges is expensive in jgrapht. -- The new approach uses the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap for each kmer pair 1 and 2 in the hash counter add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map update edge count by count from map if kmer1 -> kmer2 already existed in the graph -- This algorithm ensures that we add very much fewer edges -- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself. -- Overall runtimes are greatly reduced using this algorith	2013-04-10 17:10:59 -04:00
Mark DePristo	7d267639d8	Merge pull request #157 from broadinstitute/rp_smith_waterman_failing_test Bug fix in SWPairwiseAlignment.	2013-04-10 13:54:25 -07:00
Ryan Poplin	850be5e9da	Bug fix in SWPairwiseAlignment. -- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.	2013-04-10 16:04:37 -04:00
Eric Banks	2fb8b61dd0	Merge pull request #152 from broadinstitute/md_commonsuffix_infinite_loop_bugfix Critical bugfix for CommonSuffixSplitter to avoid infinite loops	2013-04-09 17:43:11 -07:00
droazen	97fd8c5893	Merge pull request #156 from broadinstitute/dr_github_mirror_daemon Daemon script to continually update local mirrors of our github repositories	2013-04-09 16:56:20 -07:00
David Roazen	e187258481	Daemon script to continually update local mirrors of our github repositories -Bamboo now uses the local mirrors for git checkouts, which should hopefully resolve the intermittent long delays we've been seeing in Bamboo during git operations -Use a daemon process rather than a cron job to guarantee strict serial execution of the git updates (since the time git update operations require can vary). The spawn script for the daemon runs as a cron job instead to make sure the daemon is always running and restart it if necessary.	2013-04-09 16:49:32 -04:00
Mark DePristo	b115e5c582	Critical bugfix for CommonSuffixSplitter to avoid infinite loops -- The previous version would enter into an infinite loop in the case where we have a graph that looks like: X -> A -> B Y -> A -> B So that the incoming vertices of B all have the same sequence. This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph. Fixed and unit tested -- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging. So the GATK will always error out, rather than just go on forever -- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph. Equals here means that all vertices have equivalents in both graphs, as do all edges. If the two graphs are equal we stop simplifying. It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph. -- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly -- #resolves GSA-924 -- See https://jira.broadinstitute.org/browse/GSA-924 for more details -- Update MD5s due to change in assembly graph construction	2013-04-09 16:19:26 -04:00
Mark DePristo	0194c9492d	Merge pull request #155 from broadinstitute/md_pnrm HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model	2013-04-09 12:20:55 -07:00
Mark DePristo	51954ae3e5	HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model -- HC now throws a UserException if this model is provided. Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY	2013-04-09 15:18:42 -04:00
Mark DePristo	55c9542bfd	Merge pull request #154 from broadinstitute/mc_printreads_presorted Fix PrintReads out of space issue	2013-04-09 11:58:57 -07:00
Ryan Poplin	b8bd10469d	Merge pull request #153 from broadinstitute/md_hc_ld_merge_off Turn off the LD merging code by default	2013-04-09 07:12:16 -07:00
Mark DePristo	33ecec535d	Turn off the LD merging code by default -- It's just too hard to interpret the called variation when we merge variants via LD. -- Can now be turned on with -mergeVariantsViaLD -- Update MD5s	2013-04-09 10:08:06 -04:00
Mauricio Carneiro	3960733c88	Fix PrintReads out of space issue Problem: -------- Print Reads was running out of disk space when using the -BQSR option even for small bam files Solution: --------- Configure setupWriter to expect pre sorted reads	2013-04-09 08:19:52 -04:00
droazen	ae0612b6e8	Merge pull request #150 from broadinstitute/md_hc_excessive_coverage Make ActiveRegionTraversal robust to excessive coverage	2013-04-08 13:40:39 -07:00
Mark DePristo	1b36db8940	Make ActiveRegionTraversal robust to excessive coverage -- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time. Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount. This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation. -- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly. -- #resolves GSA-921	2013-04-08 15:48:19 -04:00
Mark DePristo	317dc4c323	Add size() method to Downsampler interface -- This method provides client with the current number of elements, without having to retreive the underlying list<T>. Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones. All of the others are trivially obviously correct.	2013-04-08 15:48:13 -04:00
Ryan Poplin	0c2f795fa5	Merge pull request #147 from broadinstitute/md_hc_symbolic_allele Large number of fundamental improvements to the HaplotypeCaller	2013-04-08 10:29:07 -07:00
Mark DePristo	469dc7f22c	Update KMerErrorCorrectorUnitTest license text	2013-04-08 12:48:20 -04:00
Mark DePristo	21410690a2	Address reviewer comments	2013-04-08 12:48:20 -04:00
Mark DePristo	caf15fb727	Update MD5s to reflect new HC algorithms and parameter values	2013-04-08 12:48:16 -04:00
Mark DePristo	6d22485a4c	Critical bugfix to ReduceRead functionality of the GATKSAMRecord -- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward. -- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.	2013-04-08 12:47:50 -04:00
Mark DePristo	3097936a3d	Update GeneralCallingPipeline -- Bugfix to puts all files in the subdirectory, regardless of whether the outputDir is provided with a ending / or not -- UG now runs single threaded in GeneralCallingPipeline -- GCP HC only needs 2 GB now	2013-04-08 12:47:50 -04:00
Mark DePristo	5a54a4155a	Change key Haplotype default parameter values -- Extension increased to 200 bp -- Min prune factor defaults to 0 -- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging -- Active region trimming enabled by default	2013-04-08 12:47:50 -04:00
Mark DePristo	3a19266843	Fix residual merge conflicts	2013-04-08 12:47:50 -04:00

1 2 3 4 5 ...

12228 Commits (17982bcbf86dc4aac940cbe1f2f96c9ef40eb669) All Branches Search

12228 Commits (17982bcbf86dc4aac940cbe1f2f96c9ef40eb669)

All Branches