gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	379a9841ce	Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads. 1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup. 2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We now use a PriorityQueue instead of a TreeSet to allow for such cases. 3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number should be shared between them. 4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that information. Maybe we'll decide to change this in the future, but for now we are conservative. 5. Also implemented various small performance optimizations based on profiling. Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.	2013-04-24 18:18:50 -04:00
MauricioCarneiro	45fec382e7	Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets DiagnoseTargets Global Refactor	2013-04-24 14:54:55 -07:00
Mauricio Carneiro	367f0c0ac1	Split class names into stratification and metrics Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric. Metrics are generalized by the Metric interface. (for generic access) Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)	2013-04-24 14:15:49 -04:00
Ryan Poplin	80131ac996	Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices.	2013-04-24 11:41:32 -04:00
Guillermo del Angel	2ab270cf3f	Corner case fix to General Ploidy SNP likelihood model. -- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0 -- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.	2013-04-23 21:13:18 -04:00
Mauricio Carneiro	8f8f339e4b	Abstract class for the statistics Addressing the code duplication issue raised by Mark.	2013-04-23 18:02:27 -04:00
Mauricio Carneiro	38662f1d47	Limiting access to the DT classes * Make most classes final, others package local * Move to diagnostics.diagnosetargets package * Aggregate statistics and walker classes on the same package for simplified visibility. * Make status list a LinkedList instead of a HashSet	2013-04-23 14:01:43 -04:00
Ryan Poplin	cb4ec3437a	After debate reverting SW parameter changes temporarily while we explore global SW plans.	2013-04-23 13:32:06 -04:00
Mauricio Carneiro	fdd16dc6f9	DiagnoseTargets refactor A plugin enabled implementation of DiagnoseTargets Summarized Changes: ------------------- * move argument collection into Thresholder object * make thresholder object private member of all statistics classes * rework the logic of the mate pairing thresholds * update unit and integration tests to reflect the new behavior * Implements Locus Statistic plugins * Extend Locus Statistic plugins to determine sample status * Export all common plugin functionality into utility class * Update tests accordingly [fixes #48465557]	2013-04-22 23:53:10 -04:00
Mauricio Carneiro	eb6308a0e4	General DiagnoseTargets documentation cleanup * remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps. * add gatkdocs to all the parameters * clean up the logic on callable status a bit (still need to be re-worked into a plugin system) * update integration tests	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	b3c0abd9e8	Remove REF_N status from DiagnoseTargets This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply. [fixes #48442663]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	2b923f1568	fix for DiagnoseTargets multiple filter output Problem ------- Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval Solution -------- Only allow low median coverage check if there are reads [fixes #48442675]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	cf7afc1ad4	Fixed "skipped intervals" bug on DiagnoseTargets Problem ------- Diagnose targets was skipping intervals when they were not covered by any reads. Solution -------- Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file. Summarized Changes ------------------ * Outputs all intervals it iterates over, even if uncovered * Outputs leftover intervals in the end of the traversal * Updated integration tests [fixes #47813825]	2013-04-22 23:53:09 -04:00
Mark DePristo	be66049a6f	Bugfix for CommonSuffixSplitter -- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification -- Update MD5s, including minor improvements	2013-04-21 19:37:01 -04:00
Mark DePristo	f0e64850da	Two sensitivity / specificity improvements to the haplotype caller -- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data -- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8. -- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller NA12878 MEM chr20:10-11 Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL branch SNPS 1216 0 2 194 0 branch INDELS 312 2 13 71 7 master SNPS 1214 0 4 194 1 master INDELS 309 2 16 71 10 -- Update MD5s in the integration tests to reflect these two new changes	2013-04-17 12:32:31 -04:00
Eric Banks	5bce0e086e	Refactored binomial probability code in MathUtils. * Moved redundant code out of UGEngine * Added overloaded methods that assume p=0.5 for speed efficiency * Added unit test for the binomialCumulativeProbability method	2013-04-16 18:19:07 -04:00
Eric Banks	df189293ce	Improve compression in Reduce Reads by incorporating probabilistic model and global het compression The Problem: Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most regions in an exome) causes RR (with default settings) to consider it a variant region. This seriously hurts compression performance. The Solution: 1. We now use a probabilistic model for determining whether we can create a consensus (in other words, whether we can error correct a site) instead of the old ratio threshold. We calculate the cumulative binomial probability of seeing the given ratio and trigger consensus creation if that pvalue is lower than the provided threshold (0.01 by default, so rather conservative). 2. We also allow het compression globally, not just at known sites. So if we cannot create a consensus at a given site then we try to perform het compression; and if we cannot perform het compression that we just don't reduce the variant region. This way very wonky regions stay uncompressed, regions with one errorful read get fully compressed, and regions with one errorful locus get het compressed. Details: 1. -minvar is now deprecated in favor of -min_pvalue. 2. Added integration test for bad pvalue input. 3. -known argument still works to force het compression only at known sites; if it's not included then we allow het compression anywhere. Added unit tests for this. 4. This commit includes fixes to het compression problems that were revealed by systematic qual testing. Before finalizing het compression, we now check for insertions or other variant regions (usually due to multi-allelics) which can render a region incompressible (and we back out if we find one). We were checking for excessive softclips before, but now we add these tests too. 5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus out instead of backing out all of them. 6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we allow it to be part of the next global consensus. 7. The coverage test is no longer run systematically on all integration tests because the quals test supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases (very useful!). 8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ for a consensus was affected by good and bad bases/reads). 9. We now completely ignore low quality bases, unless they are the only bases present in a pileup. This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15. 10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start with insertions from a header. Note that this commit must come serially before the next commit in which I am refactoring the binomial prob code in MathUtils (which is failing and slow).	2013-04-16 18:19:06 -04:00
Ryan Poplin	e0dfe5ca14	Restore the read filter function in the HaplotypeCaller.	2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera	e176fc3af1	Merge pull request #159 from broadinstitute/md_bqsr_ion Trivial BQSR bug fixes and improvement	2013-04-16 08:54:47 -07:00
Ryan Poplin	936f4da1f6	Merge pull request #166 from broadinstitute/md_hc_persample_haplotypes Select the haplotypes we move forward for genotyping per sample, not poo...	2013-04-16 08:46:56 -07:00
Mark DePristo	17982bcbf8	Update MD5s for VQSR header change	2013-04-16 11:45:45 -04:00
Mark DePristo	067d24957b	Select the haplotypes we move forward for genotyping per sample, not pooled -- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller. -- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast. -- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected. -- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871. -- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples: 1000G EUR samples: 10Kb per sample: 7.17 minutes pooled: 7.36 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 50 0 5 8 1 per_sample INDELS 6 0 7 2 1 pooled SNPS 49 0 6 8 1 pooled INDELS 5 0 8 2 1 100 kb per sample: 140.00 minutes pooled: 145.27 minutes Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL per_sample SNPS 144 0 22 28 1 per_sample INDELS 28 1 16 9 11 pooled SNPS 143 0 23 28 1 pooled INDELS 27 1 17 9 11 java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 8 seconds haplotypes from pools: 8 seconds java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug haplotypes from samples: 173.32 seconds haplotypes from pools: 167.12 seconds	2013-04-16 09:42:03 -04:00
Mark DePristo	5a74a3190c	Improvements to the VariantRecalibrator R plots -- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad	2013-04-16 09:09:51 -04:00
Mark DePristo	564fe36d22	VariantRecalibrator's VQSR.vcf now contains NEG/POS labels -- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE -- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels	2013-04-16 09:09:47 -04:00
Mauricio Carneiro	9bfa5eb70f	Quick optimization to the PairHMM Problem -------- the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations. Solution --------- initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.	2013-04-14 23:25:33 -04:00
Mark DePristo	3144eae51c	UnifiedGenotyper bugfix: don't create haplotypes with 0 bases -- The PairHMM no longer allows us to create haplotypes with 0 bases. The UG indel caller used to create such haplotypes. Now we assign -Double.MAX_VALUE likelihoods to such haplotypes. -- Add integration test to cover this case, along with private/testdata BAM -- [Fixes #47523579]	2013-04-13 14:57:55 -04:00
Mauricio Carneiro	f11c8d22d4	Updating java 7 md5's to java 6 md5's	2013-04-13 08:21:48 -04:00
Mark DePristo	b32457be8d	Merge pull request #163 from broadinstitute/mc_hmm_caching_again Fix another caching issue with the PairHMM	2013-04-12 12:34:49 -07:00
Mauricio Carneiro	403f9de122	Fix another caching issue with the PairHMM The Problem ---------- Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result. Solution -------- The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes. Summarized Changes ------------------ * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0) * Updated related MD5's Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9	2013-04-12 14:52:45 -04:00
Mark DePristo	0e627bce93	Slight update to Path SW parameters. -- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.	2013-04-12 12:43:52 -04:00
Mark DePristo	50cdffc61f	Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons Key improvement --------------- -- The haplotype caller was producing unstable calls when comparing the following two haplotypes: ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble. ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches. -- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events. -- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all General code cleanup -------------------- -- Move Smith-Waterman to its own package in utils -- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers -- Update users of SW code to use new Parameter class -- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected -- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage	2013-04-11 18:22:55 -04:00
Mark DePristo	74196ff7db	Trivial BQSR bug fixes and improvement -- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam) -- Improve the error message when a unknown platform is found (integration test added)	2013-04-11 17:08:35 -04:00
Ryan Poplin	a507381a33	Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables. -- Previously would crash when a scatter/gather interval contained no usable data. -- Added unit test to cover this case.	2013-04-11 16:27:59 -04:00
Mark DePristo	fb86887bf2	Fast algorithm for determining which kmers are good in a read -- old algorithm was O(kmerSize * readLen) for each read. New algorithm is O(readLen) -- Added real unit tests for the addKmersFromReads to the graph. Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.	2013-04-11 09:54:22 -04:00
Mark DePristo	bf42be44fc	Fast DeBruijnGraph creation using the kmer counter -- The previous creation algorithm used the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to the graph add edge kmer1 -> kmer2 in the graph, if it's not present (does check) update edge count by 1 if kmer1 -> kmer2 already existed in the graph -- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)). This is actually pretty expensive because get and add edges is expensive in jgrapht. -- The new approach uses the following algorithm: for each kmer1 -> kmer2 in each read add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap for each kmer pair 1 and 2 in the hash counter add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map update edge count by count from map if kmer1 -> kmer2 already existed in the graph -- This algorithm ensures that we add very much fewer edges -- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself. -- Overall runtimes are greatly reduced using this algorith	2013-04-10 17:10:59 -04:00
Ryan Poplin	850be5e9da	Bug fix in SWPairwiseAlignment. -- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.	2013-04-10 16:04:37 -04:00
Mark DePristo	b115e5c582	Critical bugfix for CommonSuffixSplitter to avoid infinite loops -- The previous version would enter into an infinite loop in the case where we have a graph that looks like: X -> A -> B Y -> A -> B So that the incoming vertices of B all have the same sequence. This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph. Fixed and unit tested -- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging. So the GATK will always error out, rather than just go on forever -- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph. Equals here means that all vertices have equivalents in both graphs, as do all edges. If the two graphs are equal we stop simplifying. It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph. -- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly -- #resolves GSA-924 -- See https://jira.broadinstitute.org/browse/GSA-924 for more details -- Update MD5s due to change in assembly graph construction	2013-04-09 16:19:26 -04:00
Mark DePristo	51954ae3e5	HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model -- HC now throws a UserException if this model is provided. Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY	2013-04-09 15:18:42 -04:00
Mark DePristo	33ecec535d	Turn off the LD merging code by default -- It's just too hard to interpret the called variation when we merge variants via LD. -- Can now be turned on with -mergeVariantsViaLD -- Update MD5s	2013-04-09 10:08:06 -04:00
Mark DePristo	21410690a2	Address reviewer comments	2013-04-08 12:48:20 -04:00
Mark DePristo	caf15fb727	Update MD5s to reflect new HC algorithms and parameter values	2013-04-08 12:48:16 -04:00
Mark DePristo	6d22485a4c	Critical bugfix to ReduceRead functionality of the GATKSAMRecord -- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward. -- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.	2013-04-08 12:47:50 -04:00
Mark DePristo	5a54a4155a	Change key Haplotype default parameter values -- Extension increased to 200 bp -- Min prune factor defaults to 0 -- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging -- Active region trimming enabled by default	2013-04-08 12:47:50 -04:00
Mark DePristo	3a19266843	Fix residual merge conflicts	2013-04-08 12:47:50 -04:00
Mark DePristo	9c7a35f73f	HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph -- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models. Allowing cycles can be re-enabled with an HC command line switch.	2013-04-08 12:47:50 -04:00
Mark DePristo	5545c629f5	Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest	2013-04-08 12:47:49 -04:00
Mark DePristo	15461567d7	HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype -- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).	2013-04-08 12:47:49 -04:00
Mark DePristo	9b5c55a84a	LikelihoodCalculationEngine will now only use reads longer than the minReadLength, which is currently fixed at 20 bp	2013-04-08 12:47:49 -04:00
Mark DePristo	af593094a2	Major improvements to HC that trims down active regions before genotyping -- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes -- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java -- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler	2013-04-08 12:47:49 -04:00
Mark DePristo	4d389a8234	Optimizations for HC infrastructure -- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices -- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final -- Cache HashCode calculation in BaseVertex -- Better docs before the pruneGraph call	2013-04-08 12:47:49 -04:00

1 2 3 4 5 ...

684 Commits (43f1746eb912c8086bac3323a93146e41a39eba4)