gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	0a2b5ddadf	More specific fix for the dangling tail edge case with a single leading deletion. The previous fix was too general (and therefore incorrect) and caused the HC to exception out. Added "unit" test for this exact case.	2013-07-25 12:24:46 -04:00
Mauricio Carneiro	31ab0824b1	quick indentation fixes to FPGA code	2013-07-24 14:09:49 -04:00
Eric Banks	6df43f730a	Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes. Having them as bytes caused problems for downstream programmers who had data with high MQs.	2013-07-23 23:47:15 -04:00
Guillermo del Angel	9dd109b79a	Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised. So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before. Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.	2013-07-18 12:54:52 -04:00
Scott Thibault	5d198d3400	Added write to likelihoods.txt for batch hmm	2013-07-15 10:16:39 -05:00
sathibault	0a8f75b953	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java	2013-07-15 08:17:32 -05:00
Mauricio Carneiro	8c07614321	QualifyMissingIntervals: support different formats Problem ------- Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters. Solution ------- There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle). ps: Also added an average depth column to the output	2013-07-12 17:32:53 -04:00
Yossi Farjoun	afcf7b96db	- Added per-sample AlleleBiasedDownsampling capability to HaplotypeCaller - Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF - overrode default contamination value in test	2013-07-12 16:22:02 -04:00
Eric Banks	b16c7ce050	A whole slew of improvements to the Haplotype Caller and related code. 1. Some minor refactorings and claenup (e.g. removing unused imports) throughout. 2. Updates to the KB assessment functionality: a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call. b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling. 3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived the DeBruijn assembler. 4. Improvements to the likelihoods for the HC a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need to divide e by 3 because the observed base could have come from any of the non-observed alleles. b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are turned into Q0s for now. c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real sites with off kilter allele balances and we occasionally lose them to contamination removal. 5. Improved the dangling tail merging implementation.	2013-07-12 10:09:10 -04:00
sathibault	23fe3e449a	Revert "Fixed batching bug." This reverts commit 3e56c83d0eec7c374e5f187d1ef124d42ecc071e.	2013-07-11 11:30:37 -05:00
sathibault	7458b59bb3	Fixed batching bug.	2013-07-11 11:08:46 -05:00
Menachem Fromer	a8ea57df9e	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-07-10 16:44:35 -04:00
Guillermo del Angel	aba55dbb23	Moved some HC parameters related to active region extensions to command line arguments so that they're more easily modified. Some of these parameters need tinkering in order to call some large indels. See GSA-891 and subtasks for particular examples thereof.	2013-07-10 14:31:10 -04:00
Eric Banks	73fc7f6ab1	Reduce Reads output should never be expected to be sorted (hence the need to sort on disk) but for some reason it was with -nwayout mode.	2013-07-08 10:33:36 -04:00
Eric Banks	5f5c90e65c	Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site. Trivial fix, added integration test to cover it.	2013-07-05 00:04:52 -04:00
Mark DePristo	5f34054cc1	Remove filtering of MAPQ 0 reads from CalledHaplotypeBAMWriter	2013-07-02 15:46:49 -04:00
Mark DePristo	ed0b1c5aba	Fix bug in ReadThreadingAssembler in cycle failures causing NPE	2013-07-02 15:46:48 -04:00
Mark DePristo	e3e8631ff5	Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences -- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure -- -- Output format looks like: 20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121 20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217 20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240 20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268 20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267 We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty. -- Can we enabled for single samples with --emitRefConfidence (-ERC). -- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads -- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class. -- Includes GVCF writer -- Add 1 mb of WEx data to private/testdata -- Integration tests for reference model output for WGS and WEx data -- Emit GQ block information into VCF header for GVCF mode -- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC -- Control max indel size for the reference confidence model from the command line. Increase default to 10 -- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest -- Unittests for ReferenceConfidenceModel -- Unittests for new MathUtils functions	2013-07-02 15:46:38 -04:00
Mark DePristo	41aba491c0	Critical bugfix for adapter clipping in HaplotypeCaller -- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible. -- Removed the broken logic of determining whether a read adaptor is too long. -- Doesn't require isProperPairFlag to be set for a read to be adapter clipped -- Update integration tests for new adapter clipping code	2013-07-02 15:46:36 -04:00
Scott Thibault	82dcdc01c0	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java	2013-06-28 10:13:05 -05:00
Scott Thibault	e691fa3e19	FPGA null pointer bug fix	2013-06-28 08:52:09 -05:00
Ryan Poplin	825b603acb	Merge pull request #298 from broadinstitute/md_likelihood_rank_sum Md likelihood rank sum	2013-06-27 11:14:25 -07:00
Mark DePristo	a514dd0643	Merge pull request #307 from broadinstitute/eb_rr_off_by_one_error Proper fix for previous RR -cancer_mode fix.	2013-06-26 13:02:23 -07:00
Eric Banks	876e40466a	Proper fix for previous RR -cancer_mode fix. I "fixed" this once before but instead of testing with unit tests I used integration tests. Bad decision. The proper fix is in now, with a bonafide unit test included.	2013-06-26 14:48:09 -04:00
Eric Banks	f242be12c0	Make this walker @Hidden	2013-06-26 11:45:21 -04:00
Mark DePristo	ff76d0c877	Merge pull request #304 from broadinstitute/eb_rr_header_negative_fix_again Fixing the 'header is negative' problem in Reduce Reads... again.	2013-06-24 11:55:52 -07:00
Eric Banks	165b936fcd	Fixing the 'header is negative' problem in Reduce Reads... again. Previous fixes and tests only covered trailing soft-clips. Now that up front hard-clipping is working properly though, we were failing on those in the tool. Added a patch for this as well as a separate test independent of the soft-clips to make sure that it's working properly.	2013-06-24 14:06:21 -04:00
Valentin Ruano-Rubio	b97f9a487d	Merged bug fix from Stable into Unstable	2013-06-24 14:00:01 -04:00
Mark DePristo	191e4ca251	Merge pull request #300 from broadinstitute/mc_move_qualify_intervals_to_protected Few bug fixes to this tool now that it is in protected	2013-06-24 09:35:45 -07:00
Valentin Ruano-Rubio	3e5ff6095f	Added the pertinent DocumentedGATKFeature annotation ot AnalyzeCovariates	2013-06-21 17:02:26 -04:00
Eric Banks	d976aae2b1	Another fix for the Indel Realigner that arises because of secondary alignments. This time we don't accidentally drop reads (phew), but this bug does cause us not to update the alignment start of the mate. Fixed and added unit test to cover it.	2013-06-21 16:59:22 -04:00
Mark DePristo	8caf39cb65	Experimental LikelihoodRankSum annotation -- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class -- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported. -- Added @Override to many of the RankSum classes for safety's sake -- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs, -- R script to assess the value of annotations for VQSR	2013-06-21 13:57:11 -04:00
Mark DePristo	f726d8130a	VariantRecalibrator bugfix for bad log10sumlog10 values -- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.	2013-06-21 12:28:53 -04:00
Mark DePristo	dee51c4189	Error out when NCT and BAMOUT are used with the HaplotypeCaller -- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled. Check in initialize if this is the case, and throw a UserException	2013-06-21 09:25:57 -04:00
Mark DePristo	fdfe4e41d5	Better GATK version and command line output -- Previous version emitted command lines that look like: ##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..." the new version provides additional information on when the GATK was run and the GATK version in a nicer format: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ..."> -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff"> ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff"> -- Removed the ProtectedEngineFeaturesIntegrationTest -- Actual unit tests for these features!	2013-06-20 11:19:13 -04:00
sathibault	3db8908ae8	Remove debug print statement	2013-06-20 08:28:58 -05:00
Mark DePristo	0672ac5032	Fix public / protected dependency	2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio	1f8282633b	Removed plots generation from the BaseRecalibration software Improved AnalyzeCovariates (AC) integration test. Renamed AC test files ending with .grp to .table Implementation: * Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly. * Fixed BQSRIntegrationTest to work with new changes	2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio	08f92bb6f9	Added AnalyzeCovariates tool to generate BQSR assessment quality plots. Implemtation details: * Added tool class .AnalyzeCovariates Added convenient addAll method to Utils to be able to add elements of an array. * Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable. * Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output. * Added an integration test class.	2013-06-19 14:38:02 -04:00
Guillermo del Angel	f176c854c6	Swapping in logless Pair HMM for default usage with UG: -- Changed default HMM model. -- Removed check. -- Changed md5's: PL's in the high 100s change by a point or two due to new implementation. -- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.	2013-06-18 10:06:27 -04:00
Ryan Poplin	8511c4385c	Adding new pruning parameter to ReadThreadingAssembler -- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample. -- adding unit test to test this new functionality	2013-06-17 16:46:40 -04:00
Guillermo del Angel	f6025d25ae	Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing: -- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will: a) Emit all 3 ALT snp alleles in the ALT column. b) Emit all corresponding 10 PL values. It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields. -- Tweaks and fixes to processing pipelines for Reich lab.	2013-06-17 13:21:09 -04:00
delangel	485ceb1e12	Merge pull request #283 from broadinstitute/md_beagleoutput Simpler FILTER and info field encoding for BeagleOutputToVCF	2013-06-17 09:31:03 -07:00
Eric Banks	e48f754478	Fixes to several of the annotations for reduced reads (and other issues). 1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads. 2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations). 3. Fixed a massive bug in the BaseQualityRankSumTest annotation! It was not using the base qualities but rather the read likelihoods?! Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values. Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and makes insignificant distributions look significant (so it can falsely cause the filtering of good sites). Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.	2013-06-16 01:18:20 -04:00
Mark DePristo	1677a0a458	Simpler FILTER and info field encoding for BeagleOutputToVCF -- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele. -- Here's an example output line from the previous and current versions: old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88 new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15 -- update MD5s to reflect these changes -- [delivers #50847721]	2013-06-14 15:56:13 -04:00
Mark DePristo	dd5674b3b8	Add genotyping accuracy assessment to AssessNA12878 -- Now table looks like: Name VariantType AssessmentType Count variant SNPS TRUE_POSITIVE 1220 variant SNPS FALSE_POSITIVE 0 variant SNPS FALSE_NEGATIVE 1 variant SNPS TRUE_NEGATIVE 150 variant SNPS CALLED_NOT_IN_DB_AT_ALL 0 variant SNPS HET_CONCORDANCE 100.00 variant SNPS HOMVAR_CONCORDANCE 99.63 variant INDELS TRUE_POSITIVE 273 variant INDELS FALSE_POSITIVE 0 variant INDELS FALSE_NEGATIVE 15 variant INDELS TRUE_NEGATIVE 79 variant INDELS CALLED_NOT_IN_DB_AT_ALL 2 variant INDELS HET_CONCORDANCE 98.67 variant INDELS HOMVAR_CONCORDANCE 89.58 -- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously) -- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record: 20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6 Indicating that the call was a HET but the expected result was HOM_VAR -- Forbid subsetting of diploid genotypes to just a single allele. -- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG	2013-06-13 15:05:32 -04:00
Mark DePristo	33720b83eb	No longer merge overlapping fragments from HaplotypeCaller -- Merging overlapping fragments turns out to be a bad idea. In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small. That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference. Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging. Best to leave haplotype construction to the assembly graph.	2013-06-13 15:05:32 -04:00
Mark DePristo	c837d67b2f	Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled Relaxing the constraints on the readIsPoorlyModelled function.	2013-06-13 08:40:24 -07:00
Ryan Poplin	f44efc27ae	Relaxing the constraints on the readIsPoorlyModelled function. -- Turns out we were aggressively throwing out borderline-good reads.	2013-06-13 11:06:23 -04:00
Ryan Poplin	d5f0848bd5	HC bam writer now sets the read to MQ0 if it isn't informative -- Makes visualization of read evidence easier in IGV.	2013-06-13 10:11:54 -04:00
sathibault	336050ab71	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java	2013-06-13 07:28:24 -05:00
Ryan Poplin	d1f397c711	Fixing bug with dangling tails in which the tail connects all the way back to the reference source node. -- List of vertices can't contain a source node.	2013-06-12 12:23:01 -04:00
Ryan Poplin	e1fd3dff9a	Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC Eb calling accuracy improvements to hc	2013-06-11 11:18:51 -07:00
Eric Banks	2c3c680eb7	Misc changes and cleanup from all previous commits in this push. 1. By default, do not include the UG CEU callset for assessment. 2. Updated md5s that are different now with all the HC changes.	2013-06-11 12:53:11 -04:00
Eric Banks	dadcfe296d	Reworking of the dangling tails merging code. We now run Smith-Waterman on the dangling tail against the corresponding reference tail. If we can generate a reasonable, low entropy alignment then we trigger the merge to the reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs and don't let those pass through. Added tests for this implementation that checks exact SW results and correct edges added.	2013-06-11 12:53:04 -04:00
Guillermo del Angel	55d5f2194c	Read Error Corrector for haplotype assembly Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps: 1. K-mer graph buildup. For each read in an active region, a map from k-mers to the number of times they have been seen is built. 2. Building correction map. All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable). 3. Correction proposal For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty. The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent. By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.	2013-06-11 12:26:24 -04:00
Eric Banks	c0030f3f2d	We no longer subset down to the best N haplotypes for the GL calculation. I explain in comments within the code that this was causing problems with the marginalization over events.	2013-06-11 11:51:26 -04:00
Eric Banks	c0e3874db0	Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to.	2013-06-11 11:51:26 -04:00
Eric Banks	77868d034f	Do not allow the use of Ns in reads for graph construction. Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones. This was the cause of at least one FN where there were Ns at a SNP position.	2013-06-11 11:51:26 -04:00
Eric Banks	e4e7d39e2c	Fix FN problem stemming from sequence graphs that contain cycles. Problem: The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers. Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final sequence graph. Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions. Solution: We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them. If (and only if) we can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate a graph without cycles or attempt too many iterations.	2013-06-11 11:51:26 -04:00
Ryan Poplin	58e354176e	Minor changes to docs in the graph pruning.	2013-06-11 10:33:22 -04:00
Mark DePristo	1c03ebc82d	Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP -- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure -- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map -- Cleanup misc. code as it came up -- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC	2013-06-10 16:20:31 -04:00
Mark DePristo	0d593cff70	Refactor rsID and overlap detection in VariantOverlapAnnotator utility class -- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field. -- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation -- Added unit tests for this VariantOverlapAnnotator class -- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator -- Now requires strict allele matching, without any option to just use site annotation.	2013-06-10 15:51:13 -04:00
Mauricio Carneiro	1d67d07cf1	better docs for Qualify Missing Intervals now that it's available to the public, better give'em good docs!	2013-06-10 15:17:40 -04:00
Mauricio Carneiro	c84f0deb1d	Don't crash if cds file is not provided CDS file should be optional.	2013-06-10 13:42:00 -04:00
Mauricio Carneiro	a95fbd48e5	Moving QualifyMissingIntervals to protected Making this walker available so we can share it with the CSER group for CLIA analysis.	2013-06-10 13:11:41 -04:00
Valentin Ruano-Rubio	96073c3058	This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK. The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible. Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward: a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC) b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL) Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring. Implementation: * Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work. * Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants. * Added ReadFilterTest class commont parent for ReadFilter test cases. * Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class. * Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality. * Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used. * Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used. * Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior. * Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior	2013-06-10 10:44:42 -04:00
Michael McCowan	00c06e9e52	Performance improvements: - Memoized MathUtil's cumulative binomial probability function. - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.	2013-06-09 11:26:52 -04:00
Mark DePristo	209dd64268	HaplotypeCaller now emits per-sample DP -- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller -- The depth for the HC is the sum of the informative alleles at this site. It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP). -- Update MD5s -- delivers [#48240601]	2013-06-06 09:47:32 -04:00
Mark DePristo	34bdf20132	Bugfix for bad AD values in UG/HC -- In the case where we have multiple potential alternative alleles and we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles. -- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future. -- delivers [#50897077]	2013-06-05 17:48:41 -04:00
Mark DePristo	c9f5b53efa	Bugfix for HC can fail to assemble the correct reference sequence in some cases -- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger. In the case where you have this graph: ACT [ref source] -> C G -> ACT -> C we would merge into G -> ACT -> C which would linearlize into GACTC Causing us to add bases to the reference source node that couldn't be recovered. The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves. -- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM). This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error -- Did some misc. cleanup of code while debugging -- [delivers #50917729]	2013-06-03 16:16:39 -04:00
Ryan Poplin	ab40f4af43	Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler. -- Added unit test for new function.	2013-06-03 14:00:35 -04:00
Ryan Poplin	21334e728d	Merge pull request #252 from broadinstitute/md_bqsr_index_out_of_bounds Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read	2013-06-03 07:13:00 -07:00
sathibault	de2a2a4cc7	Added command-line flag to disble FPGA Completed integration with FPGA driver	2013-06-03 07:30:32 -05:00
Mark DePristo	6555361742	Fix error in merging code in HC -- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested. -- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases. -- delivers #50655421	2013-05-31 16:29:29 -04:00
Mark DePristo	64b4d80729	Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read -- The previous implementation attempted to be robust to this, but not all cases were handled properly. Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works. The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand. Everything is now consistent, as can be seen in the symmetry of the unit tests: tests.add(new Object[]{"1D3M", false, EventType.BASE_DELETION, new int[]{0,0,0}}); tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}}); tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}}); tests.add(new Object[]{"3M1D", false, EventType.BASE_DELETION, new int[]{0,0,1}}); tests.add(new Object[]{"1D3M", true, EventType.BASE_DELETION, new int[]{1,0,0}}); tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}}); tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}}); tests.add(new Object[]{"3M1D", true, EventType.BASE_DELETION, new int[]{0,0,0}}); tests.add(new Object[]{"4M1I", false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}}); tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}}); tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}}); tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}}); tests.add(new Object[]{"1I4M", false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}}); tests.add(new Object[]{"4M1I", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}}); tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}}); tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}}); tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}}); tests.add(new Object[]{"1I4M", true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}}); -- delivers #50445353	2013-05-31 13:58:37 -04:00
Ryan Poplin	b5b9d745a7	New implementation of the GGA mode in the HaplotypeCaller -- We now inject the given alleles into the reference haplotype and add them to the graph. -- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode. -- This unifies how Smith-Waterman is performed between discovery and GGA modes. -- Misc minor cleanup in several places.	2013-05-31 10:35:36 -04:00
Ryan Poplin	61af37d0d2	Create a new normalDistributionLog10 function that is unit tested for use in the VQSR.	2013-05-30 16:00:08 -04:00
Eric Banks	a5a68c09fa	Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads. The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not clips should be ignored. Also rebased so that return type is no longer a Pair. Added unit test to cover this situation.	2013-05-29 16:41:01 -04:00
Mark DePristo	684c91c2e7	Merge pull request #245 from broadinstitute/dr_enforce_min_dcov Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage	2013-05-29 09:52:13 -07:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
Mauricio Carneiro	f1affa9fbb	Turn off downsampling for DiagnoseTargets Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)	2013-05-28 14:58:50 -04:00
Ryan Poplin	85905dba92	Bugfix for GGA mode in UG silently ignoring indels -- Started by Mark. Finished up by Ryan. -- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls. -- https://www.pivotaltracker.com/story/show/48924339 for more information -- [resolves #48924339]	2013-05-24 13:47:26 -04:00
Mauricio Carneiro	da21924b44	Make the missing targets output never use stdout Problem -------- Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided Solution -------- Make it NOT default to stdout [Delivers #50386741]	2013-05-22 14:22:54 -04:00
Mark DePristo	d167743852	Archived banded logless PairHMM BandedHMM --------- -- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference Useful infrastructure changes ----------------------------- -- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM -- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class. -- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.	2013-05-22 12:24:00 -04:00
Mark DePristo	a1093ad230	Optimization for ActiveRegion.removeAll -- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one. -- Update HaplotypeCaller filterReadsInRegion to use a Set not a List. -- Expanded the unit tests a bit for ActiveRegion.removeAll	2013-05-21 16:18:57 -04:00
Eric Banks	1f3624d204	Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing	2013-05-21 11:35:05 -04:00
Valentin Ruano Rubio	71bbb25c9e	Merge pull request #231 from broadinstitute/md_combinevariants_bugfix CombineVariants no longer adds PASS to unfiltered records	2013-05-20 14:28:20 -07:00
Mark DePristo	62fc88f92e	CombineVariants no longer adds PASS to unfiltered records -- [Delivers #49876703] -- Add integration test and test file -- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!	2013-05-20 16:53:51 -04:00
Ryan Poplin	507853c583	Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result. -- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.	2013-05-20 14:29:04 -04:00
Eric Banks	8a442d3c9f	@Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating.	2013-05-17 10:04:10 -04:00
sathibault	195f0c3e98	Disable CnyPairHMM	2013-05-17 08:30:23 -05:00
Yossi Farjoun	9234a0efcd	Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs Bug fixes and missing interval functionality for Diagnose Targets While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule). It is with these cavets that I accept the pull request.	2013-05-16 09:25:54 -07:00
Chris Hartl	6da0aed30f	Update GCIT md5s to account for trivial changes to description strings	2013-05-14 19:45:30 -04:00
Yossi Farjoun	409a202492	Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC Add overall genotype concordance to the genotype concordance tool. In ad...	2013-05-14 14:19:54 -07:00
Menachem Fromer	de54223aed	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-14 10:15:21 -04:00
Mauricio Carneiro	adcbf947bf	Update MD5s and the Diagnose Target scala script	2013-05-13 12:06:17 -04:00
Mauricio Carneiro	9eceae793a	Tool to manipulate intervals outside the GATK Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.	2013-05-13 11:56:24 -04:00
Mauricio Carneiro	3dbb86b052	Outputting missing intervals in DiagnoseTargets Problem ------ Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci Solution ------ This commit implements an optional intervals file output listing the exact loci that did not pass filters Itemized changes -------------- * Cache callable statuses (to avoid recalculation) * Add functionality to output missing intervals * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)	2013-05-13 11:51:56 -04:00
Mauricio Carneiro	1466396a31	Diagnose target is outputting intervals out of order Problem ------- When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF. Solution -------- Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well. Itemized changes -------------- * Clean up code duplication in LocusStratification and SampleStratification * Add number of uncovered sites and number of low covered sites to the VCF output. * Add new VCF format fields * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported) [fixes #48780333] [fixes #48787311]	2013-05-13 11:50:22 -04:00
Mark DePristo	b4f482a421	NanoScheduled ActiveRegionTraversal and HaplotypeCaller -- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results -- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk -- Tests for NanoScheduling -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data -- Some misc. code cleanup of HaplotypeCaller -- Analysis scripts to assess performance of nano scheduled HC -- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.	2013-05-13 11:09:02 -04:00
Eric Banks	2f5ef6db44	New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points. * This version inherits from the original SW implementation so it can use the same matrix creation method. * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the right thing for indels at the edges of the alignments. * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy. * Lots of systematic testing added for this implementation. * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.	2013-05-13 09:36:39 -04:00
Mark DePristo	111e8cef0f	Merge pull request #219 from broadinstitute/eb_rr_multisample_fix Fix bug in Reduce Reads that arises in multi-sample mode.	2013-05-09 15:31:14 -07:00
Eric Banks	8b9c6aae3e	Fix bug in Reduce Reads that arises in multi-sample mode. * bitset could legitimately be in an unfinished state but we were trying to access it without finalizing. * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode. * tests were easiest to implement as integration tests (this was a really complicated case).	2013-05-08 23:23:51 -04:00
Mark DePristo	fa8a47ceef	Replace DeBruijnAssembler with ReadThreadingAssembler Problem ------- The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant. Solution -------- The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows: identify all non-unique kmers of size K among all reads and the reference for each sequence (ref and read): find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist for each base in the sequence from the starting vertex kmer: look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers. This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below: -- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes) -- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph. -- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used. -- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway. -- The old DebruijnAssembler is still available with a command line option -- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples. -- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes -- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler Infrastructure changes / improvements ------------------------------------- -- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler -- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers -- Refactored active region trimming code into ActiveRegionTrimmer class -- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default -- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions. -- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations. -- Many many more unit tests throughout the entire assembler, and in random utilities	2013-05-08 21:41:42 -04:00
sathibault	d79b5f0931	Adding Convey HC-1 HMM acceleration	2013-05-08 11:01:20 -05:00
Eric Banks	d242f1bba3	Secondary alignments were not handled correctly in IndelRealigner * This is emerging now because BWA-MEM produces lots of reads that are not primary alignments * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it was getting confused by these secondary alignments * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Eric Banks	b53336c2d0	Added hidden mode for BQSR to force all read groups to be the same one. * Very useful for debugging sample-specific issues * This argument got lost in the transition from BQSR v1 to v2 * Added unit test to cover this case	2013-05-06 19:09:10 -04:00
Menachem Fromer	c7dcc2b53b	Fix to deal with multi-generational families being allowed if only one level (one 'trio', effectively) appears in the VCF	2013-05-06 15:47:27 -04:00
Menachem Fromer	86287dce76	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-06 13:52:55 -04:00
Menachem Fromer	13240588cf	Fix to only consider the samples that are both in the PED file and in the VCF file	2013-05-06 13:52:14 -04:00
Chris Hartl	6ff74deac7	Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior. Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.	2013-05-06 13:06:30 -04:00
chartl	98021db264	Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance - Fixed a small bug in the printout of molten data in GenotypeConcordanc...	2013-05-06 08:42:06 -07:00
Menachem Fromer	78e958bf39	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-05-06 10:39:21 -04:00
Guillermo del Angel	874dc8f9c1	Re-fix md5's that changed due to conflicting pushes	2013-05-03 14:59:16 -04:00
Mark DePristo	f42bb86bdd	e# This is a combination of 2 commits. Only try to clip adaptors when both reads of the pair are on opposite strands -- Read pairs that have unusual alignments, such as two reads both oriented like: <----- <----- where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads. -- Update MD5s due change in adaptor clipping, which add some coverage in some places	2013-05-03 11:19:14 -04:00
Mark DePristo	2bcbdd469f	leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase.	2013-05-03 09:32:05 -04:00
Guillermo del Angel	0c30a5ebc6	Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VALUE) when converting from GL to integers. Increase capping to Integer.MAX_VALUE (2^31-1) which should be enough for reasonable sites now. Integration tests change because some tests have some hyper-deep pileups where this case was hit	2013-05-02 16:31:43 -04:00
Yossi Farjoun	4b8b411b92	- Fixed a small bug in the printout of molten data in GenotypeConcordance Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc. This was only a problem in the text, _not_ the actual numbers, which were outputted correctly. - Updated MD5's after looking at diffs to verify that the change is what I expected.	2013-05-02 09:16:07 -04:00
David Roazen	f3c94a3c87	Update expected test output for Java 7 -Changes in Java 7 related to comparators / sorting produce a large number of innocuous differences in our test output. Updating expectations now that we've moved to using Java 7 internally. -Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent intermittent failures.	2013-05-01 16:18:01 -04:00
Eric Banks	58424e56be	Setting the reduce reads count tag was all wrong in a previous commit; fixing. RR counts are represented as offsets from the first count, but that wasn't being done correctly when counts are adjusted on the fly. Also, we were triggering the expensive conversion and writing to binary tags even when we weren't going to write the read to disk. The code has been updated so that unconverted counts are passed to the GATKSAMRecord and it knows how to encode the tag correctly. Also, there are now methods to write to the reduced counts array without forcing the conversion (and methods that do force the conversion). Also: 1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows about the internal encoding. 2. as discussed in meetings today, we updated the encoding so that it can now handle a range of values that extends to 255 instead of 127 (and is backwards compatible). 3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.	2013-04-30 13:45:42 -04:00
Guillermo del Angel	20d3137928	Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should. -- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype. -- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.	2013-04-29 19:33:02 -04:00
Mark DePristo	0387ea8df9	Bugfix for ReadClipper with ReducedReads -- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts. Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly. Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read -- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously	2013-04-29 11:12:09 -04:00
Mark DePristo	5dd73ba2d1	Merge pull request #198 from broadinstitute/mc_reduce_reads_ds_doc Updates GATKDocs for ReduceReads downsampling	2013-04-27 05:49:47 -07:00
Mauricio Carneiro	76e997895e	Updates GATKDocs for ReduceReads downsampling [fixes #48258295]	2013-04-26 23:33:44 -04:00
Guillermo del Angel	4168aaf280	Add feature to specify Allele frequency priors by command line when calling variants. Use case: The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele. Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence). Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions. With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs. -- Updated ancient DNA single sample calling script with filtering options and other cleanups. -- Added integration test. Removed old -noPrior syntax.	2013-04-26 19:06:39 -04:00
Mark DePristo	759c531d1b	Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check Add support for snpEff "GATK compatibility mode" (-o gatk)	2013-04-26 13:55:14 -07:00
David Roazen	7d90bbab08	Add support for snpEff "GATK compatibility mode" (-o gatk) -Do not throw an exception when parsing snpEff output files generated by not-officially-supported versions of snpEff, PROVIDED that snpEff was run with -o gatk -Requested by the snpEff author -Relevant integration tests updated/expanded	2013-04-26 15:47:15 -04:00
Mark DePristo	071fd67d55	Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads Eb contamination fixing for reduced reads	2013-04-26 09:48:45 -07:00
Mark DePristo	92a6c7b561	Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants Fixed bug reported on the forum where using the --exclude_sample_file ar...	2013-04-26 09:47:38 -07:00
Eric Banks	360e2ba87e	Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results. Added integration test. https://www.pivotaltracker.com/s/projects/793457/stories/47399245	2013-04-26 12:23:11 -04:00
Eric Banks	021adf4220	WTF - I thought we had disabled the randomized dithering of rank sum tests for integration tests?! Well, it wasn't done so I went ahead and did so. Lots of MD5 changes accordingly.	2013-04-26 11:24:05 -04:00
Eric Banks	ba2c3b57ed	Extended the allele-biased down-sampling functionality to handle reduced reads. Note that this works only in the case of pileups (i.e. coming from UG); allele-biased down-sampling for RR just cannot work for haplotypes. Added lots of unit tests for new functionality.	2013-04-26 11:23:17 -04:00
Mark DePristo	d20be41fee	Bugfix for FragmentUtils.mergeOverlappingPairedFragments -- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences. If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference. Updated the code to first clip the adapter sequences from the incoming fragments -- Update MD5s	2013-04-25 11:11:15 -04:00
Eric Banks	379a9841ce	Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads. 1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup. 2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We now use a PriorityQueue instead of a TreeSet to allow for such cases. 3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number should be shared between them. 4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that information. Maybe we'll decide to change this in the future, but for now we are conservative. 5. Also implemented various small performance optimizations based on profiling. Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.	2013-04-24 18:18:50 -04:00
MauricioCarneiro	45fec382e7	Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets DiagnoseTargets Global Refactor	2013-04-24 14:54:55 -07:00
Mauricio Carneiro	367f0c0ac1	Split class names into stratification and metrics Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric. Metrics are generalized by the Metric interface. (for generic access) Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)	2013-04-24 14:15:49 -04:00
Ryan Poplin	80131ac996	Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices.	2013-04-24 11:41:32 -04:00
Guillermo del Angel	2ab270cf3f	Corner case fix to General Ploidy SNP likelihood model. -- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0 -- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.	2013-04-23 21:13:18 -04:00
Mauricio Carneiro	8f8f339e4b	Abstract class for the statistics Addressing the code duplication issue raised by Mark.	2013-04-23 18:02:27 -04:00
Mauricio Carneiro	38662f1d47	Limiting access to the DT classes * Make most classes final, others package local * Move to diagnostics.diagnosetargets package * Aggregate statistics and walker classes on the same package for simplified visibility. * Make status list a LinkedList instead of a HashSet	2013-04-23 14:01:43 -04:00
Ryan Poplin	cb4ec3437a	After debate reverting SW parameter changes temporarily while we explore global SW plans.	2013-04-23 13:32:06 -04:00
Mauricio Carneiro	fdd16dc6f9	DiagnoseTargets refactor A plugin enabled implementation of DiagnoseTargets Summarized Changes: ------------------- * move argument collection into Thresholder object * make thresholder object private member of all statistics classes * rework the logic of the mate pairing thresholds * update unit and integration tests to reflect the new behavior * Implements Locus Statistic plugins * Extend Locus Statistic plugins to determine sample status * Export all common plugin functionality into utility class * Update tests accordingly [fixes #48465557]	2013-04-22 23:53:10 -04:00
Mauricio Carneiro	eb6308a0e4	General DiagnoseTargets documentation cleanup * remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps. * add gatkdocs to all the parameters * clean up the logic on callable status a bit (still need to be re-worked into a plugin system) * update integration tests	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	b3c0abd9e8	Remove REF_N status from DiagnoseTargets This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply. [fixes #48442663]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	2b923f1568	fix for DiagnoseTargets multiple filter output Problem ------- Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval Solution -------- Only allow low median coverage check if there are reads [fixes #48442675]	2013-04-22 23:53:09 -04:00
Mauricio Carneiro	cf7afc1ad4	Fixed "skipped intervals" bug on DiagnoseTargets Problem ------- Diagnose targets was skipping intervals when they were not covered by any reads. Solution -------- Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file. Summarized Changes ------------------ * Outputs all intervals it iterates over, even if uncovered * Outputs leftover intervals in the end of the traversal * Updated integration tests [fixes #47813825]	2013-04-22 23:53:09 -04:00
Mark DePristo	be66049a6f	Bugfix for CommonSuffixSplitter -- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification -- Update MD5s, including minor improvements	2013-04-21 19:37:01 -04:00
Mark DePristo	f0e64850da	Two sensitivity / specificity improvements to the haplotype caller -- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data -- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8. -- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller NA12878 MEM chr20:10-11 Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL branch SNPS 1216 0 2 194 0 branch INDELS 312 2 13 71 7 master SNPS 1214 0 4 194 1 master INDELS 309 2 16 71 10 -- Update MD5s in the integration tests to reflect these two new changes	2013-04-17 12:32:31 -04:00
Eric Banks	5bce0e086e	Refactored binomial probability code in MathUtils. * Moved redundant code out of UGEngine * Added overloaded methods that assume p=0.5 for speed efficiency * Added unit test for the binomialCumulativeProbability method	2013-04-16 18:19:07 -04:00

1 2 3 4 5 ...

925 Commits (7cd304fb41d9e7bfc0bf41edc74b87f1ef943373)