gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mauricio Carneiro	cf7afc1ad4	Fixed "skipped intervals" bug on DiagnoseTargets Problem ------- Diagnose targets was skipping intervals when they were not covered by any reads. Solution -------- Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file. Summarized Changes ------------------ * Outputs all intervals it iterates over, even if uncovered * Outputs leftover intervals in the end of the traversal * Updated integration tests [fixes #47813825]	2013-04-22 23:53:09 -04:00
Mark DePristo	be66049a6f	Bugfix for CommonSuffixSplitter -- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification -- Update MD5s, including minor improvements	2013-04-21 19:37:01 -04:00
Mark DePristo	f0e64850da	Two sensitivity / specificity improvements to the haplotype caller -- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data -- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8. -- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller NA12878 MEM chr20:10-11 Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL branch SNPS 1216 0 2 194 0 branch INDELS 312 2 13 71 7 master SNPS 1214 0 4 194 1 master INDELS 309 2 16 71 10 -- Update MD5s in the integration tests to reflect these two new changes	2013-04-17 12:32:31 -04:00
Eric Banks	5bce0e086e	Refactored binomial probability code in MathUtils. * Moved redundant code out of UGEngine * Added overloaded methods that assume p=0.5 for speed efficiency * Added unit test for the binomialCumulativeProbability method	2013-04-16 18:19:07 -04:00
Eric Banks	df189293ce	Improve compression in Reduce Reads by incorporating probabilistic model and global het compression The Problem: Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most regions in an exome) causes RR (with default settings) to consider it a variant region. This seriously hurts compression performance. The Solution: 1. We now use a probabilistic model for determining whether we can create a consensus (in other words, whether we can error correct a site) instead of the old ratio threshold. We calculate the cumulative binomial probability of seeing the given ratio and trigger consensus creation if that pvalue is lower than the provided threshold (0.01 by default, so rather conservative). 2. We also allow het compression globally, not just at known sites. So if we cannot create a consensus at a given site then we try to perform het compression; and if we cannot perform het compression that we just don't reduce the variant region. This way very wonky regions stay uncompressed, regions with one errorful read get fully compressed, and regions with one errorful locus get het compressed. Details: 1. -minvar is now deprecated in favor of -min_pvalue. 2. Added integration test for bad pvalue input. 3. -known argument still works to force het compression only at known sites; if it's not included then we allow het compression anywhere. Added unit tests for this. 4. This commit includes fixes to het compression problems that were revealed by systematic qual testing. Before finalizing het compression, we now check for insertions or other variant regions (usually due to multi-allelics) which can render a region incompressible (and we back out if we find one). We were checking for excessive softclips before, but now we add these tests too. 5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus out instead of backing out all of them. 6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we allow it to be part of the next global consensus. 7. The coverage test is no longer run systematically on all integration tests because the quals test supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases (very useful!). 8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ for a consensus was affected by good and bad bases/reads). 9. We now completely ignore low quality bases, unless they are the only bases present in a pileup. This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15. 10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start with insertions from a header. Note that this commit must come serially before the next commit in which I am refactoring the binomial prob code in MathUtils (which is failing and slow).	2013-04-16 18:19:06 -04:00
Ryan Poplin	e0dfe5ca14	Restore the read filter function in the HaplotypeCaller.	2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera	e176fc3af1	Merge pull request #159 from broadinstitute/md_bqsr_ion Trivial BQSR bug fixes and improvement	2013-04-16 08:54:47 -07:00
Ryan Poplin	936f4da1f6	Merge pull request #166 from broadinstitute/md_hc_persample_haplotypes Select the haplotypes we move forward for genotyping per sample, not poo...	2013-04-16 08:46:56 -07:00
Mark DePristo	17982bcbf8	Update MD5s for VQSR header change	2013-04-16 11:45:45 -04:00
Mark DePristo	067d24957b	Select the haplotypes we move forward for genotyping per sample, not pooled -- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller. -- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast. -- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected. -- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871. -- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples: 1000G EUR samples: 10Kb per sample: 7.17 minutes pooled: 7.36 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 50 0 5 8 1 per_sample INDELS 6 0 7 2 1 pooled SNPS 49 0 6 8 1 pooled INDELS 5 0 8 2 1 100 kb per sample: 140.00 minutes pooled: 145.27 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 144 0 22 28 1 per_sample INDELS 28 1 16 9 11 pooled SNPS 143 0 23 28 1 pooled INDELS 27 1 17 9 11 java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 8 seconds haplotypes from pools: 8 seconds java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 173.32 seconds haplotypes from pools: 167.12 seconds	2013-04-16 09:42:03 -04:00
Mark DePristo	5a74a3190c	Improvements to the VariantRecalibrator R plots -- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad	2013-04-16 09:09:51 -04:00
Mark DePristo	564fe36d22	VariantRecalibrator's VQSR.vcf now contains NEG/POS labels -- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE -- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels	2013-04-16 09:09:47 -04:00
Mauricio Carneiro	9bfa5eb70f	Quick optimization to the PairHMM Problem -------- the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations. Solution --------- initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.	2013-04-14 23:25:33 -04:00
Mark DePristo	3144eae51c	UnifiedGenotyper bugfix: don't create haplotypes with 0 bases -- The PairHMM no longer allows us to create haplotypes with 0 bases. The UG indel caller used to create such haplotypes. Now we assign -Double.MAX_VALUE likelihoods to such haplotypes. -- Add integration test to cover this case, along with private/testdata BAM -- [Fixes #47523579]	2013-04-13 14:57:55 -04:00
Mauricio Carneiro	f11c8d22d4	Updating java 7 md5's to java 6 md5's	2013-04-13 08:21:48 -04:00
Mark DePristo	b32457be8d	Merge pull request #163 from broadinstitute/mc_hmm_caching_again Fix another caching issue with the PairHMM	2013-04-12 12:34:49 -07:00
Mauricio Carneiro	403f9de122	Fix another caching issue with the PairHMM The Problem ---------- Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result. Solution -------- The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes. Summarized Changes ------------------ * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0) * Updated related MD5's Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9	2013-04-12 14:52:45 -04:00
Mark DePristo	0e627bce93	Slight update to Path SW parameters. -- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.	2013-04-12 12:43:52 -04:00
Mark DePristo	50cdffc61f	Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons Key improvement --------------- -- The haplotype caller was producing unstable calls when comparing the following two haplotypes: ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble. ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches. -- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events. -- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all General code cleanup -------------------- -- Move Smith-Waterman to its own package in utils -- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers -- Update users of SW code to use new Parameter class -- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected -- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage	2013-04-11 18:22:55 -04:00
Mark DePristo	74196ff7db	Trivial BQSR bug fixes and improvement -- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam) -- Improve the error message when a unknown platform is found (integration test added)	2013-04-11 17:08:35 -04:00
Ryan Poplin	a507381a33	Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables. -- Previously would crash when a scatter/gather interval contained no usable data. -- Added unit test to cover this case.	2013-04-11 16:27:59 -04:00
Mark DePristo	fb86887bf2	Fast algorithm for determining which kmers are good in a read -- old algorithm was O(kmerSize * readLen) for each read. New algorithm is O(readLen) -- Added real unit tests for the addKmersFromReads to the graph. Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.	2013-04-11 09:54:22 -04:00
Mark DePristo	bf42be44fc	Fast DeBruijnGraph creation using the kmer counter -- The previous creation algorithm used the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to the graph add edge kmer1 -> kmer2 in the graph, if it's not present (does check) update edge count by 1 if kmer1 -> kmer2 already existed in the graph -- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)). This is actually pretty expensive because get and add edges is expensive in jgrapht. -- The new approach uses the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap for each kmer pair 1 and 2 in the hash counter add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map update edge count by count from map if kmer1 -> kmer2 already existed in the graph -- This algorithm ensures that we add very much fewer edges -- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself. -- Overall runtimes are greatly reduced using this algorith	2013-04-10 17:10:59 -04:00
Ryan Poplin	850be5e9da	Bug fix in SWPairwiseAlignment. -- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.	2013-04-10 16:04:37 -04:00
Mark DePristo	b115e5c582	Critical bugfix for CommonSuffixSplitter to avoid infinite loops -- The previous version would enter into an infinite loop in the case where we have a graph that looks like: X -> A -> B Y -> A -> B So that the incoming vertices of B all have the same sequence. This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph. Fixed and unit tested -- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging. So the GATK will always error out, rather than just go on forever -- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph. Equals here means that all vertices have equivalents in both graphs, as do all edges. If the two graphs are equal we stop simplifying. It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph. -- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly -- #resolves GSA-924 -- See https://jira.broadinstitute.org/browse/GSA-924 for more details -- Update MD5s due to change in assembly graph construction	2013-04-09 16:19:26 -04:00
Mark DePristo	51954ae3e5	HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model -- HC now throws a UserException if this model is provided. Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY	2013-04-09 15:18:42 -04:00
Mark DePristo	33ecec535d	Turn off the LD merging code by default -- It's just too hard to interpret the called variation when we merge variants via LD. -- Can now be turned on with -mergeVariantsViaLD -- Update MD5s	2013-04-09 10:08:06 -04:00
Mark DePristo	21410690a2	Address reviewer comments	2013-04-08 12:48:20 -04:00
Mark DePristo	caf15fb727	Update MD5s to reflect new HC algorithms and parameter values	2013-04-08 12:48:16 -04:00
Mark DePristo	6d22485a4c	Critical bugfix to ReduceRead functionality of the GATKSAMRecord -- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward. -- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.	2013-04-08 12:47:50 -04:00
Mark DePristo	5a54a4155a	Change key Haplotype default parameter values -- Extension increased to 200 bp -- Min prune factor defaults to 0 -- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging -- Active region trimming enabled by default	2013-04-08 12:47:50 -04:00
Mark DePristo	3a19266843	Fix residual merge conflicts	2013-04-08 12:47:50 -04:00
Mark DePristo	9c7a35f73f	HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph -- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models. Allowing cycles can be re-enabled with an HC command line switch.	2013-04-08 12:47:50 -04:00
Mark DePristo	5545c629f5	Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest	2013-04-08 12:47:49 -04:00
Mark DePristo	15461567d7	HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype -- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).	2013-04-08 12:47:49 -04:00
Mark DePristo	9b5c55a84a	LikelihoodCalculationEngine will now only use reads longer than the minReadLength, which is currently fixed at 20 bp	2013-04-08 12:47:49 -04:00
Mark DePristo	af593094a2	Major improvements to HC that trims down active regions before genotyping -- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes -- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java -- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler	2013-04-08 12:47:49 -04:00
Mark DePristo	4d389a8234	Optimizations for HC infrastructure -- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices -- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final -- Cache HashCode calculation in BaseVertex -- Better docs before the pruneGraph call	2013-04-08 12:47:49 -04:00
Mark DePristo	e916998784	Bugfix for head and tail merging code in SeqGraph -- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so. This would introduce large deletion events at the start / end of the assemblies. Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.	2013-04-08 12:47:48 -04:00
Mark DePristo	2aac9e2782	More efficient ZipLinearChains algorithm -- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go. Vastly more efficient than the previous version, but unfortunately doesn't actually work now -- Also incorporate edge weight propagation into SeqGraph zipLinearChains. The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges	2013-04-08 12:47:48 -04:00
Mark DePristo	f1d772ac25	LD-based merging algorithm for nearby events in the haplotypes -- Moved R^2 LD haplotype merging system to the utils.haplotype package -- New LD merging only enabled with HC argument. -- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable -- A few misc. bug fixes in LDMerger itself -- Refactoring of Haplotype event splitting and merging code -- Renamed EventExtractor to EventMap -- EventMap has a static method that computes the event maps among n haplotypes -- Refactor Haplotype score and base comparators into their own classes and unit tested them -- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it. -- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful -- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population. -- Fixed fundamental bug in the way the previous R^2 implementation worked -- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are -- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events -- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested -- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled -- This new version does a much more specific test (that's actually right). Here's the new algorithm: * Compute probability that two variants are in phase with each other and that no * compound hets exist in the population. * * Implemented as a likelihood ratio test of the hypothesis: * * x11 and x22 are the only haplotypes in the populations * * vs. * * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population. * * Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22). * * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are: * * - P(x11 & x12 & x21) -- we have hom-ref and both hets * - P(x22 & x12 & x21) -- we have hom-alt and both hets * - P(x22 & x12) -- one haplotype is 22 and the other is het 12 * - P(x22 & x21) -- one haplotype is 22 and the other is het 21	2013-04-08 12:47:48 -04:00
Mark DePristo	67cd407854	The GenotypingEngine now uses the samples from the mapping of Samples -> PerReadAllele likelihoods instead of passing around a redundant list of samples	2013-04-08 12:47:47 -04:00
Mark DePristo	0310499b65	System to merge multiple nearby alleles into block substitutions -- Block substitution algorithm that merges nearby events based on distance. -- Also does some cleanup of GenotypingEngine	2013-04-08 12:47:47 -04:00
Mark DePristo	bff13bb5c5	Move Haplotype class to its own package in utils	2013-04-08 12:47:47 -04:00
Mauricio Carneiro	ebe2edbef3	Fix caching indices in the PairHMM Problem: -------- PairHMM was generating positive likelihoods (even after the re-work of the model) Solution: --------- The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM. Summarized Changes: ------------------ * Redesign the matrices to have only 1 col/row of padding instead of 2. * PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start) * Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version) * Adjust the prior and probability matrices to be one based (logless) * Update Log10PairHMM to work with prior and probability matrices as well * Move prior and probability matrices to parent class * Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations * Simple cleanup of PairHMMUnitTest class for a little speedup * Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base). * Create static indices for the transition probabilities (for better readability) [fixes #47399227]	2013-04-08 11:05:12 -04:00
Eric Banks	6253ba164e	Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs * This occurred when one or more alleles were lost from the record after selection * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718 * Added some integration tests for --keepOriginalAC (there were none before)	2013-04-05 00:53:28 -04:00
Eric Banks	7897d52f32	Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec). * As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270 * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF * Added integration test to confirm failure with User Error * Removed illegal header line in KB test VCF that was causing related tests to fail.	2013-04-05 00:52:32 -04:00
Ryan Poplin	8a93bb687b	Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists. -- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create duplicate active regions. -- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside of the current interval. -- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.	2013-04-03 13:15:30 -04:00
Mark DePristo	bb42c90f2b	Use LinkedHashSets in incoming and outgoing vertex functions in BaseGraph -- Using a LinkedHashSet changed the md5 for HCTestComplexVariants.	2013-04-02 17:58:20 -04:00
David Roazen	b4b58a3968	Fix unprintable character in a comment from the BaseEdge class Compiler warnings about this were starting to get to me...	2013-04-02 14:24:23 -04:00
Mark DePristo	c191d7de8c	Critical bugfix for CommonSuffixSplitter -- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm. Created unit test that reproduced the issue, and then fixed the underlying issue.	2013-04-02 09:22:33 -04:00
Ryan Poplin	a58a3e7e1e	Merge pull request #134 from broadinstitute/mc_phmm_experiments PairHMM rework	2013-04-01 12:10:43 -07:00
Ryan Poplin	f65206e758	Two changes to HC GGA mode to make it more like the UG. -- Only try to genotype PASSing records in the alleles file -- Don't attempt to genotype multiple records with the same start location. Instead take the first record and throw a warning message.	2013-04-01 10:20:23 -04:00
Mark DePristo	7c83efc1b9	Merge pull request #135 from broadinstitute/mc_pgtag_fix Fixing @PG tag uniqueness issue	2013-03-31 11:36:40 -07:00
Eric Banks	7dd58f671f	Merge pull request #132 from broadinstitute/gda_filter_unmasked_sites Added small feature to VariantFiltration to filter sites outside of a gi...	2013-03-31 06:27:26 -07:00
Guillermo del Angel	9686e91a51	Added small feature to VariantFiltration to filter sites outside of a given mask: -- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask. -- Added integration test, and made sure we also test with a BED rod.	2013-03-31 08:48:16 -04:00
Eric Banks	8e2094d2af	Updated AssessReducedQuals and applied it systematically to all ReduceReads integration tests. * Moved to protected for packaging purposes. * Cleaned up and removed debugging output. * Fixed logic for epsilons so that we really only test significant differences between BAMs. * Other small fixes (e.g. don't include low quality reduced reads in overall qual). * Most RR integration tests now automatically run the quals test on output. * A few are disabled because we expect them to fail in various locations (e.g. due to downsampling).	2013-03-31 00:27:14 -04:00
Mauricio Carneiro	ec475a46b1	Fixing @PG tag uniqueness issue The Problem: ------------ the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records". How this fixes it: ------------------ This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec. Summarized changes: ------------------- * Remove keep_all_pg_records option from setupWriter utility methos in Util * Update all walkers to now replace the last @PG tag of the same walker (if it already exists) * Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable * Simplify the multiple implementations to setupWriter Bamboo: ------- http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31 Issue Tracker: -------------- [fixes 47100885]	2013-03-30 20:31:33 -04:00
Mauricio Carneiro	68bf470524	making LoglessPairHMM final	2013-03-30 20:00:45 -04:00
Guillermo del Angel	6b8bed34d0	Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work. -- Corrected logic to pick biallelic vc to left align. -- Added integration test to make sure this feature is tested and feature to trim bases is also tested.	2013-03-30 19:31:28 -04:00
Mauricio Carneiro	0de6f55660	PairHMM rework The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because: # Initial conditions were not being set properly # Emission probabilities in the last row were not adding up to 1 The following commit fixes both by # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability) # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state) Summarized changes: * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1. * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now) * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity * Rename metric lengths to read and haplotype lengths for clarity * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix * Remove unnecessary parameters from updateCell() * Fix the expected probabilities coming from the exact model in PairHMMUnitTest * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables) * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s. [fix 47164949]	2013-03-30 10:50:06 -04:00
Chris Hartl	74a17359a8	MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods that are tested), resulting in slightly different numbers of calls to the RNG, and ultimately different sets of selected variants. This commits updates the md5 values for the validation site selector integration test to reflect these new random subsets of variants that are selected.	2013-03-29 14:52:10 -04:00
Guillermo del Angel	8fbf9c947f	Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production: -- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records. -- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels. -- Motivated by this, renamed walker to LeftAlignAndTrimVariants. -- Code refactoring, cleanup and bring up to latest coding standards. -- Added unit testing to make sure left alignment is performed correctly for all offsets. -- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.	2013-03-29 10:02:06 -04:00
Chris Hartl	73d1c319bf	Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils Currently, the multi-allelic test is covering the following case: Eval A T,C Comp A C reciprocate this so that the reverse can be covered. Eval A C Comp A T,C And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample. This comes from the fact that "truth" reference alleles can be paired with any allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine: Eval: A G,T 0/0 2/0 2/2 1/1 Comp: A C,T 0/0 1/0 0/0 0/0 Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows: Sample1: ref called ref Sample2: alleles don't match (the alt allele of the comp was not assessed in eval) Sample3: ref called hom-var Sample4: alleles don't match (the alt allele of the eval was not assessed in comp) Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference. Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed: - dotProduct(several types). logDotProduct is used extensively, but not the real-space version. - vectorSum - array shuffle, random subset - countOccurances (general forms, the char form is used in the codebase) - getNMaxElements - array permutation - sorted array permutation - compare floats - sum() (for integer arrays and lists). Final keyword was extensively added to MathUtils. The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired). The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself. In addition, more extensive tests were added for - logBinomialCoefficient (Newton's identity should always hold) - logFactorial - log10sumlog10 and its approximation All unit tests pass	2013-03-28 23:25:28 -04:00
MauricioCarneiro	a2b69790a6	Merge pull request #128 from broadinstitute/eb_rr_polyploid_compression_GSA-639	2013-03-28 06:39:43 -07:00
Mark DePristo	fde7d36926	Updating md5s due to changes in assembly graph creation algorithms and default parameter	2013-03-27 15:31:24 -04:00
Mark DePristo	197d149495	Increase the maxNumHaplotypesInPopulation to 25 -- A somewhat arbitrary increase, and will need some evaluation but necessary to get good results on the AFR integrationtest.	2013-03-27 15:31:24 -04:00
Mark DePristo	66910b036c	Added new and improved suffix and node merging algorithms -- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges. Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger. -- Refactored haplotype caller infrastructure into graphs package, code cleanup -- Cleanup new merging / splitting algorithms, with proper docs and unit tests -- Fix bug in zipping of linear chains. Because the multiplicity can be 0, protect ourselves with a max function call -- Fix BaseEdge.max unit test -- Add docs and some more unit tests -- Move error correct from DeBruijnGraph to DeBruijnAssembler -- Replaced uses of System.out.println with logger.info -- Don't make multiplicity == 0 nodes look like they should be pruned -- Fix toString of Path	2013-03-27 15:31:18 -04:00
Mark DePristo	39f2e811e5	Increase max cigar elements from SW before failing path creation to 20 from 6 -- This allows more diversity in paths, which is sometimes necessary when we cannot simply graphs that have large bubbles	2013-03-26 14:27:18 -04:00
Mark DePristo	b1b615b668	BaseGraph shouldn't implement getEdge -- no idea why I added this	2013-03-26 14:27:18 -04:00
Mark DePristo	a97576384d	Fix bug in the HC not respecting the requested pruning	2013-03-26 14:27:18 -04:00
Mark DePristo	78c672676b	Bugfix for pruning and removing non-reference edges in graph -- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base). This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph. Probably we should think hard about this. This commit fixes this logic, so we zip the graph between pruning -- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink. In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink. The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph. Other routines that use the same traversal idiom refactored to use this iterator. Added unit tests for all of these capabilities. -- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate	2013-03-26 14:27:18 -04:00
Mark DePristo	ad04fdb233	PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now -- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative. -- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)	2013-03-26 14:27:13 -04:00
Mark DePristo	2472828e1c	HC bug fixes: no longer create reference graphs with cycles -- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer. The previous test would fail to catch this case. Now its not possible -- Lots of code cleanup and refactoring in this push. Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph. -- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count. -- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear	2013-03-26 10:12:24 -04:00
Mark DePristo	1917d55dc2	Bugfix for DeBruijnAssembler: don't fail when read length > haplotype length -- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error	2013-03-26 10:12:17 -04:00
Mark DePristo	464e65ea96	Disable error correcting kmers by default in the HC -- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument. -- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods. -- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate -- Don't write out an errorcorrected graph when debugging and error correction isn't enabled	2013-03-26 10:05:17 -04:00
Eric Banks	593d3469d4	Refactored the het (polyploid) consensus creation in ReduceReads. * It is now cleaner and easier to test; added tests for newly implemented methods. * Many fixes to the logic to make it work * The most important change was that after triggering het compression we actually need to back it out if it creates reads that incorporated too many softclips at any one position (because they get unclipped). * There was also an off-by-one error in the general code that only manifested itself with het compression. * Removed support for creating a het consensus around deletions (which was broken anyways). * Mauricio gave his blessing for this. * Het compression now works only against known sites (with -known argument). * The user can pass in one or more VCFs with known SNPs (other variants are ignored). * If no known SNPs are provided het compression will automatically be disabled. * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their strandedness from normal reduced reads. * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded. * This allows us to update the FisherStrand annotation to count het compressed reduced reads towards the FS calculation. * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be backwards compatible.] * Updated integration tests accordingly with new het compressed bams (both for RR and UG). * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling RR properly, so I fixed it too. * Also, the test in the UG engine for determining whether there are too many overlapping deletions is updated to handle RR. * I added a special hook in the RR integration tests to additionally run the systematic coverage checking tool I wrote earlier. * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is not lost from original to reduced bam. * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads from all but 1 sample (now fixed). * AssessReducedCoverage moved from private to protected for packaging reasons. * #resolve GSA-639 At this point, this commit encompasses most of what is needed for het compression to go live. There are still a few TODO items that I want to get in before the 2.5 release, but I will save those for a separate branch because as it is I feel bad for the person who needs to review all these changes (sorry, Mauricio).	2013-03-25 09:34:54 -04:00
Mark DePristo	965043472a	Vastly more powerful, cleaner graph simplification approach -- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node. -- This new suite of algorithms is fully integrated into the HC, replacing previous approaches -- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.	2013-03-23 17:40:55 -04:00
Ryan Poplin	c15453542e	Merge pull request #124 from broadinstitute/md_hc_lowmapq_read_filter HC now by default only uses reads with MAPQ >= 20 for assembly and calli...	2013-03-21 12:00:28 -07:00
Mark DePristo	7ae15dadbe	HC now by default only uses reads with MAPQ >= 20 for assembly and calling -- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB -- Update MD5s to reflect changes due to no longer including mmq < 20 by default	2013-03-21 13:10:50 -04:00
Ryan Poplin	b9c331c2fa	Bug fix in HC gga mode. -- Don't try to test alleles which haven't had haplotypes assigned to them	2013-03-21 11:02:41 -04:00
Mark DePristo	aa7f172b18	Cap the computational cost of the kmer based error correction in the DeBruijnGraph -- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used. -- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms -- Updating MD5s to reflect changes in assembly algorithms	2013-03-21 09:21:35 -04:00
Mark DePristo	d94b3f85bc	Increase NUM_BEST_PATHS_PER_KMER_GRAPH in DeBruijnAssembler to 25 -- The value of 11 was too small to properly return a real low-frequency variant in our the 1000G AFR integration test.	2013-03-20 22:54:38 -04:00
Mark DePristo	6d7d21ca47	Bugfix for incorrect branch diamond merging algorithm -- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase -- Resolves https://jira.broadinstitute.org/browse/GSA-884	2013-03-20 22:54:37 -04:00
Mark DePristo	3a8f001c27	Misc. fixes upon pull request review -- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!) -- Remove the maxHaplotypesToConsider feature of HC as it's not useful	2013-03-20 22:54:37 -04:00
Mark DePristo	d3b756bdc7	BaseVertex optimization: don't clone byte[] unnecessarily -- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive	2013-03-20 22:54:37 -04:00
Mark DePristo	5226b24a11	HaplotypeCaller instructure cleanup and unit testing -- UnitTest for isRootOfDiamond along with key bugfix detected while testing -- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming -- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs -- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence -- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself -- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code. -- Near final cleanup of the current codebase -- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint	2013-03-20 22:54:37 -04:00
Mark DePristo	2e36f15861	Update md5s to reflect new downsampling and assembly algorithm output -- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome	2013-03-20 22:54:37 -04:00
Mark DePristo	1fa5050faf	Cleanup, unit test, and optimize KBestPaths and Path -- Split Path from inner class of KBestPaths -- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation -- Path now properly typed throughout the code -- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast	2013-03-20 22:54:36 -04:00
Mark DePristo	98c4cd060d	HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes. -- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code -- An a HaplotypeCaller command line option to use low-quality bases in the assembly -- Refactored DeBruijnGraph and associated libraries into base class -- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes. -- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct -- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split -- Moved generic methods in DeBruijnAssembler into BaseGraph -- Created a simple SeqGraph that contains SeqVertex objects -- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs -- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z -- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges -- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification. -- Debugging makes it easier to tell which kmer graph contributes to a haplotype -- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector -- Remove unnecessary printing of cleaning info in BaseGraph -- Turn off kmer graph creation in DeBruijnAssembler.java -- Only print SeqGraphs when debugGraphTransformations is set to true -- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality. -- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs -- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now	2013-03-20 22:54:36 -04:00
Mark DePristo	0f4328f6fe	Basic kmer error correction algorithm xfor the HaplotypeCaller -- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away -- The assembler logic is now: build initial graph, error correct, merge nodes, prune dead nodes, merge again, make haplotypes. The * elements are new -- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing. -- Easier way to control the testing of the graph assembly algorithms -- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler -- Simple protected parsing function for making DeBruijnAssemblyGraph -- Change the default prune factor for the graph to 1, from 2 -- debugging graph transformations are controllable from command line	2013-03-20 22:54:36 -04:00
Mark DePristo	53a904bcbd	Bugfix for HaplotypeCaller: GSA-822 for trimming softclipped reads -- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added -- See https://jira.broadinstitute.org/browse/GSA-822 -- #resolve #fix GSA-822	2013-03-20 22:54:36 -04:00
Mark DePristo	ffea6dd95f	HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping -- Added a -dontGenotype mode for testing assembly efficiency -- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted	2013-03-20 22:54:36 -04:00
Mark DePristo	a783f19ab1	Fix for potential HaplotypeCaller bug in annotation ordering -- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added. -- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine	2013-03-20 22:54:35 -04:00
Eric Banks	1fae750ebe	Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache Clear ReduceReads name cache after each set of reads produced by ReduceR...	2013-03-20 19:47:42 -07:00
Guillermo del Angel	ea01dbf130	Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers. In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up. Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site. -- Added unit test to UG engine to cover this case. -- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes). GSA-878 #resolve	2013-03-20 14:30:37 -04:00
Geraldine Van der Auwera	95a9ed853d	Made some documentation updates & fixes --Mostly doc block tweaks --Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.	2013-03-20 06:15:20 -04:00
Alec Wysoker	bccc9d79e5	Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash. Name cache was filling up with names of all reads in entire file, which for large file eventually consumes all of memory. Only keep read name cache for the reads that are together in one variant region, so that a pair of reads within the same variant region will still be joined via read name. Otherwise the ability to connect a read to its mate is lost. Update MD5s in integration test to reflect altered output. Add new integration test that confirms that pair within variant region is joined by read name.	2013-03-19 14:12:33 -04:00
Ryan Poplin	0cf5d30dac	Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes.	2013-03-15 14:20:25 -04:00
Ryan Poplin	b8991f5e98	Fix for edge case bug of trying to create insertions/deletions on the edge of contigs. -- Added integration test using MT that previously failed	2013-03-15 12:32:13 -04:00
Mark DePristo	2d35065238	QualityByDepth remaps QD values > 40 to a gaussian around 30 -- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads. In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value. This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR. -- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm -- Update md5s	2013-03-14 16:09:41 -04:00
droazen	0fd9f0e77c	Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837 Fixed the logic of the @Output annotation and its interaction with 'required'	2013-03-14 12:52:00 -07:00
Ryan Poplin	38914384d1	Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878.	2013-03-14 14:44:18 -04:00
Geraldine Van der Auwera	61349ecefa	Cleaned up annotations - Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden - Removed unnecessary check in AverageAltAlleleLength	2013-03-14 14:26:48 -04:00
Eric Banks	7cab709a88	Fixed the logic of the @Output annotation and its interaction with 'required'. ALL GATK DEVELOPERS PLEASE READ NOTES BELOW: I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag. * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided. * The logic for @Output is now: * if required==true then -o MUST be provided or a User Error is generated. * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided. * this is the default behavior (i.e. @Output with no modifiers). * if required==false and defaultToStdout==false then the output object is null. * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878). * I have updated walkers so that previous behavior has been maintained (as best I could). * In general, all @Outputs with default long/short names have required=false. * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this) * I added unit tests for @Output changes with David's help (thanks!). * #resolve GSA-837	2013-03-14 11:58:51 -04:00
Mark DePristo	b5b63eaac7	New GATKSAMRecord concept of a strandless read, update to FS -- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests -- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments -- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together. -- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone. -- Update MD5s for change to FS calculation. Differences are just minor updates to the FS	2013-03-13 11:16:36 -04:00
MauricioCarneiro	4403e3572a	Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111	2013-03-12 13:02:35 -07:00
MauricioCarneiro	3a16ba04d4	Merge pull request #97 from broadinstitute/eb_refactor_sliding_window Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug	2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera	f972963918	Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) GATK-73 updated docs for bqsr args GATK-9 differentiate CountRODs from CountRODsByRef GATK-76 generate GATKDoc for CatVariants GATK-4 made resource arg required GATK-10 added -o, some docs to CountMales; some docs to CountLoci GATK-11 fixed by MC's -o change; straightened out the docs. GATK-77 fixed references to wiki GATK-76 Added Ami's doc block GATK-14 Added note that these annotations can only be used with VariantAnnotator GATK-15 specified required=false for two arguments GATK-23 Added documentation block GATK-33 Added documentation GATK-34 Added documentation GATK-32 Corrected arg name and docstring in DiffObjects GATK-32 Added note to DO doc about reference (required but unused) GATK-29 Added doc block to CountIntervals GATK-31 Added @Output PrintStream to enable -o GATK-35 Touched up docs GATK-36 Touched up docs, specified verbosity is optional GATK-60 Corrected GContent annot module location in gatkdocs GATK-68 touched up docs and arg docstrings GATK-16 Added note of caution about calling RODRequiringAnnotations as a group GATK-61 Added run requirements (num samples, min genotype quality) Tweaked template and generic doc block formatting (h2 to h3 titles) GATK-62 Added a caveat to HR annot Made experimental annotation hidden GATK-75 Added setup info regarding BWA GATK-22 Clarified some argument requirements GATK-48 Clarified -G doc comments GATK-67 Added arg requirement GATK-58 Added annotation and usage docs GSATDG-96 Corrected doc Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)	2013-03-12 10:57:14 -04:00
Eric Banks	05e69b6294	Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug. * Allow RR to write its BAM to stdout by setting required=true for @Output. * Fixed bug in sliding window where a break in coverage after a long stretch without a variant region was causing a doubling of all the reads before the break. * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods. * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().	2013-03-12 09:06:55 -04:00
Ryan Poplin	c96fbcb995	Use the indel heterozygosity prior when calling indels with the HC	2013-03-11 14:12:43 -04:00
Guillermo del Angel	695723ba43	Two features useful for ancient DNA processing. Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly. Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert. If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence. -- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence. -- Unit tests added for trimming operation. -- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data. -- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs). Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced. -- Added -noPrior argument to StandardCallerArgumentCollection. -- Added option not to fill priors is such argument is set. -- Added an integration test.	2013-03-09 18:18:13 -05:00
Yossi Farjoun	baad965a57	- Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed. - This was needed since samples with spaces in their names are regularly found in the picard pipeline. - Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly) - Cleaned up the unit tests using a @DataProvider (I'm in love...). - Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)	2013-03-07 13:04:24 -05:00
Eric Banks	3759d9dd67	Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine. * ReadTransformers can say they must be first, must be last, or don't care. * By default, none of the existing ones care about ordering except BQSR (must be first). * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR. * The engine now orders the read transformers up front before applying iterators. * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully). * Added unit tests.	2013-03-06 12:38:59 -05:00
Eric Banks	78721ee09b	Added new walker to split MNPs into their allelic primitives (SNPs). * Can be extended to complex alleles at some point. * Currently only works for bi-allelics (documented). * Added unit and integration tests.	2013-03-05 23:16:42 -05:00
Eric Banks	bbbaf9ad20	Revert push from stable (I forgot that pushing from stable overwrites current unstable changes)	2013-03-05 09:06:02 -05:00
Eric Banks	a037423225	Merged bug fix from Stable into Unstable	2013-03-05 09:03:48 -05:00
Eric Banks	7e1bfd6a7c	Included an accidental change from unstable into the previous push	2013-03-05 09:03:31 -05:00
Eric Banks	bd4e4f4ee3	Merged bug fix from Stable into Unstable	2013-03-04 23:24:44 -05:00
Eric Banks	b715218bfe	Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals.	2013-03-04 23:23:18 -05:00
Ryan Poplin	ce7554e9d6	Merged bug fix from Stable into Unstable	2013-03-04 12:36:04 -05:00
Ryan Poplin	0697594778	Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException.	2013-03-04 12:35:40 -05:00
Mark DePristo	42d3919ca4	Expanded functionality for writing BAMs from HaplotypeCaller -- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV. -- Previous functionality maintained, with bug fixes -- Haplotype BAM writing code now lives in utils -- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes. -- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes -- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best -- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars -- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils -- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization. -- Added extensive docs to HaplotypeCaller on how to use this capability -- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC -- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW -- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s -- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case -- Writes out haplotypes for both all haplotype and called haplotype mode -- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter	2013-03-03 12:07:29 -05:00
David Roazen	c5c99c8339	Split long-running integration test classes into multiple classes This is to facilitate the current experiment with class-level test suite parallelism. It's our hope that with these changes, we can get the runtime of the integration test suite down to 20 minutes or so. -UnifiedGenotyper tests: these divided nicely into logical categories that also happened to distribute the runtime fairly evenly -UnifiedGenotyperPloidy: these had to be divided arbitrarily into two classes in order to halve the runtime -HaplotypeCaller: turns out that the tests for complex and symbolic variants make up half the runtime here, so merely moving these into a separate class was sufficient -BiasedDownsampling: most of these tests use excessively large intervals that likely can't be reduced without defeating the goals of the tests. I'm disabling these tests for now until they can either be redesigned to use smaller intervals around the variants of interest, or refactored into unit tests (creating a JIRA for Yossi for this task)	2013-03-01 13:55:23 -05:00
depristo	cac3f80c64	Merge pull request #73 from broadinstitute/eb_remove_nested_hashmap_GSA-732 Replace uses of NestedHashMap with NestedIntegerArray.	2013-02-28 05:19:56 -08:00
Eric Banks	d2904cb636	Update docs for RTC.	2013-02-27 14:56:44 -05:00
Eric Banks	69b8173535	Replace uses of NestedHashMap with NestedIntegerArray. * Removed from codebase NestedHashMap since it is unused and untested. * Integration tests change because the BQSR CSV is now sorted automatically. * Resolves GSA-732	2013-02-27 14:03:39 -05:00
Alec Wysoker	c8368ae2a5	Eliminate 7-element arrays in BaseCounts and BaseAndQualsCount and replace with in-line primitive attributes. This is ugly but reduces heap overhead, and changes are localized. When used in conjunction with Mauricio's FastUtil changes it saves and additional 9% or so of execution time.	2013-02-27 12:49:56 -05:00
David Roazen	6466463d5a	Merged bug fix from Stable into Unstable	2013-02-26 21:54:54 -05:00
David Roazen	12a3d7ecad	Fix licenses on files modified in 2.4-1	2013-02-26 21:53:17 -05:00
David Roazen	a53b4a7521	Merged bug fix from Stable into Unstable	2013-02-26 21:41:13 -05:00
David Roazen	65d31ba4ad	Fix runtime public -> protected dependencies in the test suite -replace unnecessary uses of the UnifiedGenotyper by public integration tests with PrintReads -move NanoSchedulerIntegrationTest to protected, since it's completely dependent on the UnifiedGenotyper	2013-02-26 21:19:12 -05:00
depristo	93205154b5	Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776 Eb fix pairhmm unittest gsa 776	2013-02-26 11:56:58 -08:00
Eric Banks	734353e9df	Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83 Brought all of ReduceReads to fastutils	2013-02-26 11:56:41 -08:00
David Roazen	8b29030467	Change default downsampling coverage target for the HaplotypeCaller to 250 -was previously set to 30, which seems far too aggressive given that with ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any pileup returned by LIBS -250 is a more conservative default used by the UG -can adjust down/up later based on further experiments (GSA-699 will remain open) -verified with Ryan that all integration test differences are either innocent or represent an improvement GSA-699	2013-02-26 09:33:25 -05:00
Eric Banks	396b7e0933	Fixed the intermittent PairHMM unit test failure. The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of that class; so if the optimized tests were run first we were unintentionally running those same tests again with the basic ones (but expecting different results).	2013-02-25 15:05:13 -05:00
Eric Banks	7519484a38	Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods.	2013-02-25 15:04:23 -05:00
Ryan Poplin	89e2943dd1	The maximum kmer length is derived from the reads. -- This is done to take advantage of longer reads which can produce less ambiguous haplotypes -- Integration tests change for HC and BiasedDownsampling	2013-02-25 14:40:25 -05:00
Mauricio Carneiro	0ff3343282	Addressing Eric's comments -- added @param docs to the new variables -- made all variables final -- switched to string builder instead of String for performance. GSATDG-83	2013-02-25 13:33:47 -05:00
Mauricio Carneiro	9e5a31b595	Brought all of ReduceReads to fastutils -- Added unit tests to ReduceReads name compression -- Updated reduce reads walker for unit testing GSATDG-83	2013-02-23 22:53:23 -05:00
Ryan Poplin	6a639c8ffc	Replace Smith-Waterman alignment with the bubble traversal. -- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph. -- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime. -- Refactoring graph functions into a new DeBruijnAssemblyGraph class. -- Bug fix in path.getBases(). -- Adding validation code to the assembly engine. -- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class. -- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes. -- Added ability to ignore bubbles that are too divergent from the reference -- Max kmer can't be bigger than the extension size. -- Reverse the order that we create the assembly graphs so that the bigger kmers are used first. -- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment. -- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation. -- Updating HaplotypeCaller and BiasedDownsampling integration tests. -- Rebased everything into one commit as requested by Eric -- improvements to the bubble traversal are coming as a separate push	2013-02-22 15:42:16 -05:00
Mauricio Carneiro	e3f01673e1	Implementation of the find and diagnose Queue script -- Added 'uncovered intervals' output for FindCoveredIntervals -- updated scala script to make use of it.	2013-02-22 10:19:01 -05:00
Ryan Poplin	62e14f5b58	Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores.	2013-02-21 14:34:17 -05:00
Eric Banks	6996a953a8	Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample). These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons): * Don't unnecessarily overload Allele.getBases() in the Haplotype class. * Haplotype.getBases() was calling clone() on the byte array. * Added a constructor to Allele (and Haplotype) that takes in an Allele as input. * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated). * Rev'ed the variant jar accordingly. For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.	2013-02-21 10:14:11 -05:00
Eric Banks	551d33686c	Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761 Reduce memory footprint of SyntheticRead by replacing several Lists with...	2013-02-20 04:49:07 -08:00
Eric Banks	9dfdb9528b	Merge pull request #49 from broadinstitute/gda_hidden_ug_args Hide arguments related to reference sample operation in UG - for interna...	2013-02-19 16:18:32 -08:00
Eric Banks	0055a6f1cd	Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774 Fix to the Indel Realigner bug described in GSA-774	2013-02-19 16:16:48 -08:00
Guillermo del Angel	5a0a9bc488	Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished.	2013-02-19 19:06:42 -05:00
Mauricio Carneiro	371ea2f24c	Fixed IndelRealigner reference length bug (GSA-774) -- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases -- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases -- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them) -- pulled out ReadBin class so it can be testable -- added unit tests for ReadBin with soft-clips -- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads GSA-774 #resolve	2013-02-19 16:00:36 -05:00
Alec Wysoker	ab75e053da	Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static class that contains the attributes that were scattered across the several Lists.	2013-02-19 15:33:33 -05:00

1 2 3 4 5 ...

772 Commits (0018af0c0af3100d220315cc0b21b76b86f0e415)