gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	a1093ad230	Optimization for ActiveRegion.removeAll -- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one. -- Update HaplotypeCaller filterReadsInRegion to use a Set not a List. -- Expanded the unit tests a bit for ActiveRegion.removeAll	2013-05-21 16:18:57 -04:00
Eric Banks	1f3624d204	Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing	2013-05-21 11:35:05 -04:00
Valentin Ruano Rubio	71bbb25c9e	Merge pull request #231 from broadinstitute/md_combinevariants_bugfix CombineVariants no longer adds PASS to unfiltered records	2013-05-20 14:28:20 -07:00
Mark DePristo	62fc88f92e	CombineVariants no longer adds PASS to unfiltered records -- [Delivers #49876703] -- Add integration test and test file -- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!	2013-05-20 16:53:51 -04:00
Ryan Poplin	507853c583	Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result. -- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.	2013-05-20 14:29:04 -04:00
Eric Banks	8a442d3c9f	@Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating.	2013-05-17 10:04:10 -04:00
Yossi Farjoun	9234a0efcd	Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs Bug fixes and missing interval functionality for Diagnose Targets While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule). It is with these cavets that I accept the pull request.	2013-05-16 09:25:54 -07:00
Chris Hartl	6da0aed30f	Update GCIT md5s to account for trivial changes to description strings	2013-05-14 19:45:30 -04:00
Yossi Farjoun	409a202492	Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC Add overall genotype concordance to the genotype concordance tool. In ad...	2013-05-14 14:19:54 -07:00
Mauricio Carneiro	adcbf947bf	Update MD5s and the Diagnose Target scala script	2013-05-13 12:06:17 -04:00
Mauricio Carneiro	9eceae793a	Tool to manipulate intervals outside the GATK Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.	2013-05-13 11:56:24 -04:00
Mauricio Carneiro	3dbb86b052	Outputting missing intervals in DiagnoseTargets Problem ------ Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci Solution ------ This commit implements an optional intervals file output listing the exact loci that did not pass filters Itemized changes -------------- * Cache callable statuses (to avoid recalculation) * Add functionality to output missing intervals * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)	2013-05-13 11:51:56 -04:00
Mauricio Carneiro	1466396a31	Diagnose target is outputting intervals out of order Problem ------- When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF. Solution -------- Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well. Itemized changes -------------- * Clean up code duplication in LocusStratification and SampleStratification * Add number of uncovered sites and number of low covered sites to the VCF output. * Add new VCF format fields * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported) [fixes #48780333] [fixes #48787311]	2013-05-13 11:50:22 -04:00
Mark DePristo	b4f482a421	NanoScheduled ActiveRegionTraversal and HaplotypeCaller -- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results -- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk -- Tests for NanoScheduling -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data -- Some misc. code cleanup of HaplotypeCaller -- Analysis scripts to assess performance of nano scheduled HC -- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.	2013-05-13 11:09:02 -04:00
Eric Banks	2f5ef6db44	New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points. * This version inherits from the original SW implementation so it can use the same matrix creation method. * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the right thing for indels at the edges of the alignments. * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy. * Lots of systematic testing added for this implementation. * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.	2013-05-13 09:36:39 -04:00
Mark DePristo	111e8cef0f	Merge pull request #219 from broadinstitute/eb_rr_multisample_fix Fix bug in Reduce Reads that arises in multi-sample mode.	2013-05-09 15:31:14 -07:00
Eric Banks	8b9c6aae3e	Fix bug in Reduce Reads that arises in multi-sample mode. * bitset could legitimately be in an unfinished state but we were trying to access it without finalizing. * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode. * tests were easiest to implement as integration tests (this was a really complicated case).	2013-05-08 23:23:51 -04:00
Mark DePristo	fa8a47ceef	Replace DeBruijnAssembler with ReadThreadingAssembler Problem ------- The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant. Solution -------- The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows: identify all non-unique kmers of size K among all reads and the reference for each sequence (ref and read): find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist for each base in the sequence from the starting vertex kmer: look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers. This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below: -- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes) -- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph. -- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used. -- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway. -- The old DebruijnAssembler is still available with a command line option -- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples. -- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes -- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler Infrastructure changes / improvements ------------------------------------- -- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler -- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers -- Refactored active region trimming code into ActiveRegionTrimmer class -- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default -- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions. -- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations. -- Many many more unit tests throughout the entire assembler, and in random utilities	2013-05-08 21:41:42 -04:00
Eric Banks	d242f1bba3	Secondary alignments were not handled correctly in IndelRealigner * This is emerging now because BWA-MEM produces lots of reads that are not primary alignments * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it was getting confused by these secondary alignments * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Eric Banks	b53336c2d0	Added hidden mode for BQSR to force all read groups to be the same one. * Very useful for debugging sample-specific issues * This argument got lost in the transition from BQSR v1 to v2 * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Chris Hartl	6ff74deac7	Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior. Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.	2013-05-06 13:06:30 -04:00
chartl	98021db264	Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance - Fixed a small bug in the printout of molten data in GenotypeConcordanc...	2013-05-06 08:42:06 -07:00
Guillermo del Angel	874dc8f9c1	Re-fix md5's that changed due to conflicting pushes	2013-05-03 14:59:16 -04:00
Mark DePristo	f42bb86bdd	e# This is a combination of 2 commits. Only try to clip adaptors when both reads of the pair are on opposite strands -- Read pairs that have unusual alignments, such as two reads both oriented like: <----- <----- where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads. -- Update MD5s due change in adaptor clipping, which add some coverage in some places	2013-05-03 11:19:14 -04:00
Mark DePristo	2bcbdd469f	leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase.	2013-05-03 09:32:05 -04:00
Guillermo del Angel	0c30a5ebc6	Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VALUE) when converting from GL to integers. Increase capping to Integer.MAX_VALUE (2^31-1) which should be enough for reasonable sites now. Integration tests change because some tests have some hyper-deep pileups where this case was hit	2013-05-02 16:31:43 -04:00
Yossi Farjoun	4b8b411b92	- Fixed a small bug in the printout of molten data in GenotypeConcordance Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc. This was only a problem in the text, _not_ the actual numbers, which were outputted correctly. - Updated MD5's after looking at diffs to verify that the change is what I expected.	2013-05-02 09:16:07 -04:00
David Roazen	f3c94a3c87	Update expected test output for Java 7 -Changes in Java 7 related to comparators / sorting produce a large number of innocuous differences in our test output. Updating expectations now that we've moved to using Java 7 internally. -Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent intermittent failures.	2013-05-01 16:18:01 -04:00
Eric Banks	58424e56be	Setting the reduce reads count tag was all wrong in a previous commit; fixing. RR counts are represented as offsets from the first count, but that wasn't being done correctly when counts are adjusted on the fly. Also, we were triggering the expensive conversion and writing to binary tags even when we weren't going to write the read to disk. The code has been updated so that unconverted counts are passed to the GATKSAMRecord and it knows how to encode the tag correctly. Also, there are now methods to write to the reduced counts array without forcing the conversion (and methods that do force the conversion). Also: 1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows about the internal encoding. 2. as discussed in meetings today, we updated the encoding so that it can now handle a range of values that extends to 255 instead of 127 (and is backwards compatible). 3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.	2013-04-30 13:45:42 -04:00
Guillermo del Angel	20d3137928	Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should. -- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype. -- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.	2013-04-29 19:33:02 -04:00
Mark DePristo	0387ea8df9	Bugfix for ReadClipper with ReducedReads -- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts. Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly. Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read -- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously	2013-04-29 11:12:09 -04:00
Mark DePristo	5dd73ba2d1	Merge pull request #198 from broadinstitute/mc_reduce_reads_ds_doc Updates GATKDocs for ReduceReads downsampling	2013-04-27 05:49:47 -07:00
Mauricio Carneiro	76e997895e	Updates GATKDocs for ReduceReads downsampling [fixes #48258295]	2013-04-26 23:33:44 -04:00
Guillermo del Angel	4168aaf280	Add feature to specify Allele frequency priors by command line when calling variants. Use case: The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele. Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence). Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions. With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs. -- Updated ancient DNA single sample calling script with filtering options and other cleanups. -- Added integration test. Removed old -noPrior syntax.	2013-04-26 19:06:39 -04:00
Mark DePristo	759c531d1b	Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check Add support for snpEff "GATK compatibility mode" (-o gatk)	2013-04-26 13:55:14 -07:00
David Roazen	7d90bbab08	Add support for snpEff "GATK compatibility mode" (-o gatk) -Do not throw an exception when parsing snpEff output files generated by not-officially-supported versions of snpEff, PROVIDED that snpEff was run with -o gatk -Requested by the snpEff author -Relevant integration tests updated/expanded	2013-04-26 15:47:15 -04:00
Mark DePristo	071fd67d55	Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads Eb contamination fixing for reduced reads	2013-04-26 09:48:45 -07:00
Mark DePristo	92a6c7b561	Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants Fixed bug reported on the forum where using the --exclude_sample_file ar...	2013-04-26 09:47:38 -07:00
Eric Banks	360e2ba87e	Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results. Added integration test. https://www.pivotaltracker.com/s/projects/793457/stories/47399245	2013-04-26 12:23:11 -04:00
Eric Banks	021adf4220	WTF - I thought we had disabled the randomized dithering of rank sum tests for integration tests?! Well, it wasn't done so I went ahead and did so. Lots of MD5 changes accordingly.	2013-04-26 11:24:05 -04:00
Eric Banks	ba2c3b57ed	Extended the allele-biased down-sampling functionality to handle reduced reads. Note that this works only in the case of pileups (i.e. coming from UG); allele-biased down-sampling for RR just cannot work for haplotypes. Added lots of unit tests for new functionality.	2013-04-26 11:23:17 -04:00
Mark DePristo	d20be41fee	Bugfix for FragmentUtils.mergeOverlappingPairedFragments -- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences. If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference. Updated the code to first clip the adapter sequences from the incoming fragments -- Update MD5s	2013-04-25 11:11:15 -04:00
Eric Banks	379a9841ce	Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads. 1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup. 2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We now use a PriorityQueue instead of a TreeSet to allow for such cases. 3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number should be shared between them. 4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that information. Maybe we'll decide to change this in the future, but for now we are conservative. 5. Also implemented various small performance optimizations based on profiling. Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.	2013-04-24 18:18:50 -04:00
MauricioCarneiro	45fec382e7	Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets DiagnoseTargets Global Refactor	2013-04-24 14:54:55 -07:00
Mauricio Carneiro	367f0c0ac1	Split class names into stratification and metrics Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric. Metrics are generalized by the Metric interface. (for generic access) Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)	2013-04-24 14:15:49 -04:00
Ryan Poplin	80131ac996	Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices.	2013-04-24 11:41:32 -04:00
Guillermo del Angel	2ab270cf3f	Corner case fix to General Ploidy SNP likelihood model. -- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0 -- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.	2013-04-23 21:13:18 -04:00
Mauricio Carneiro	8f8f339e4b	Abstract class for the statistics Addressing the code duplication issue raised by Mark.	2013-04-23 18:02:27 -04:00
Mauricio Carneiro	38662f1d47	Limiting access to the DT classes * Make most classes final, others package local * Move to diagnostics.diagnosetargets package * Aggregate statistics and walker classes on the same package for simplified visibility. * Make status list a LinkedList instead of a HashSet	2013-04-23 14:01:43 -04:00
Ryan Poplin	cb4ec3437a	After debate reverting SW parameter changes temporarily while we explore global SW plans.	2013-04-23 13:32:06 -04:00

1 2 3 4 5 ...

726 Commits (a1093ad230f1c24e4b9c2bf0e622d43ff2138e05)