gatk-3.8

Commit Graph

Author	SHA1	Message	Date
amilev	28a8d74290	Merge pull request #293 from broadinstitute/md_catvariants CatVariants accepts reference files ending in any standard extension	2013-06-19 08:36:58 -07:00
Mark DePristo	15171c07a8	CatVariants accepts reference files ending in any standard extension -- [resolves #49339235] Make CatVariants accept reference files ending in .fa (not only .fasta)	2013-06-19 11:10:36 -04:00
MauricioCarneiro	6a5502c94a	Merge pull request #289 from broadinstitute/md_fix_bq Bugfix: defaultBaseQualities actually works now	2013-06-18 11:58:39 -07:00
delangel	1c400e8f8e	Merge pull request #291 from broadinstitute/gda_new_hmm_in_ug Swapping in logless Pair HMM for default usage with UG:	2013-06-18 07:07:57 -07:00
Guillermo del Angel	f176c854c6	Swapping in logless Pair HMM for default usage with UG: -- Changed default HMM model. -- Removed check. -- Changed md5's: PL's in the high 100s change by a point or two due to new implementation. -- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.	2013-06-18 10:06:27 -04:00
Mark DePristo	4c482eb0f0	Merge pull request #290 from broadinstitute/rp_pruning_priority_queue Adding new pruning parameter to ReadThreadingAssembler	2013-06-17 17:16:00 -07:00
Ryan Poplin	8511c4385c	Adding new pruning parameter to ReadThreadingAssembler -- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample. -- adding unit test to test this new functionality	2013-06-17 16:46:40 -04:00
delangel	a6a58cbc78	Merge pull request #288 from broadinstitute/gda_more_ancient_dna_fixes Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA ...	2013-06-17 13:04:21 -07:00
Mark DePristo	cb5b1c3c34	Create README.md	2013-06-17 16:03:45 -03:00
Mark DePristo	7b22467148	Bugfix: defaultBaseQualities actually works now -- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work. -- [delivers #49538319]	2013-06-17 14:37:27 -04:00
Guillermo del Angel	f6025d25ae	Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing: -- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will: a) Emit all 3 ALT snp alleles in the ALT column. b) Emit all corresponding 10 PL values. It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields. -- Tweaks and fixes to processing pipelines for Reich lab.	2013-06-17 13:21:09 -04:00
Mark DePristo	fce448cc9e	Merge pull request #287 from broadinstitute/md_gzip_vcf_nt Bugfix: allow gzip VCF output in multi-threaded GATK output	2013-06-17 09:39:37 -07:00
Mark DePristo	b69d210255	Bugfix: allow gzip VCF output in multi-threaded GATK output -- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward. -- [delivers #47399279]	2013-06-17 12:39:18 -04:00
delangel	485ceb1e12	Merge pull request #283 from broadinstitute/md_beagleoutput Simpler FILTER and info field encoding for BeagleOutputToVCF	2013-06-17 09:31:03 -07:00
Mark DePristo	5b1a472d2c	Merge pull request #286 from broadinstitute/eb_add_tiers_to_KBconsensus Added 2 new fields to the MongoVariantContext: confidence and isComplex.	2013-06-17 08:38:57 -07:00
Mark DePristo	ee78927bdb	Merge pull request #279 from broadinstitute/eb_make_rms_mq_work_with_rr Fixes to several of the annotations for reduced reads (and other issues)...	2013-06-16 09:48:19 -07:00
Eric Banks	e48f754478	Fixes to several of the annotations for reduced reads (and other issues). 1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads. 2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations). 3. Fixed a massive bug in the BaseQualityRankSumTest annotation! It was not using the base qualities but rather the read likelihoods?! Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values. Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and makes insignificant distributions look significant (so it can falsely cause the filtering of good sites). Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.	2013-06-16 01:18:20 -04:00
Eric Banks	9ec71bba26	Added 2 new fields to the MongoVariantContext: confidence and isComplex. IsComplex will be used to designate calls as representing complex events which have multiple correct allele representations. Then call sets can get points for including them but will not get penalized for missing them (because they may have used a different representation). This is currently the biggest bane when trying to characterize FNs these days. The confidence will be used to refactor the consensus making algorithm for the truth status of the NA12878 KB. The previous version allowed for 2 tiers: reviews and everything else. But that is problematic when some of the input sets are of higher quality than others because when they disagree the calls become discordant and we lose that information. The new framework will allow each call to have its own associated confidence. Then when determining the consensus truth status we probabilistically calculate it from the various confidences, so that nothing is hard coded in anymore. Note that I added some unit tests to ensure the outcome that I expect for various scenarios and then implemented a very rough version of the estimator that successfully produced those outcomes. HOWEVER, THIS IS NOT COMPLETE AND NEITHER FUNCTIONALITY IS HOOKED UP AT ALL. Rather, this is an interim commit. The only goal here is to get these fields added to the MVC for the upcoming release so that Jacob (who prefers to work with stable) can add the necessary functionality to IGV for us.	2013-06-16 00:31:16 -04:00
droazen	4151753718	Merge pull request #285 from broadinstitute/dr_james_warren_fasta_suffix_bugfix deducing dictionary path should not use global find and replace	2013-06-14 16:57:10 -07:00
James Warren	f46f7d9b23	deducing dictionary path should not use global find and replace Signed-off-by: David Roazen <droazen@broadinstitute.org>	2013-06-14 19:15:27 -04:00
Mark DePristo	52677429a0	Merge pull request #284 from broadinstitute/dr_fewer_stranded_temp_files Reduce number of leftover temp files in GATK runs	2013-06-14 13:06:28 -07:00
Mark DePristo	1677a0a458	Simpler FILTER and info field encoding for BeagleOutputToVCF -- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele. -- Here's an example output line from the previous and current versions: old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88 new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15 -- update MD5s to reflect these changes -- [delivers #50847721]	2013-06-14 15:56:13 -04:00
David Roazen	d167292688	Reduce number of leftover temp files in GATK runs -WalkerTest now deletes .idx files on exit -ArtificialBAMBuilder now deletes .bai files on exit -VariantsToBinaryPed walker now deletes its temp files on exit	2013-06-14 15:56:03 -04:00
Mark DePristo	b72880cc94	Merge pull request #282 from broadinstitute/md_gatklogs_gitversions Use git hash to lookup versions when necessary in analyzeRunReports.py	2013-06-14 12:39:54 -07:00
Mark DePristo	20bb4902a3	Use git hash to lookup versions when necessary in analyzeRunReports.py	2013-06-14 15:31:25 -04:00
Mark DePristo	50ea098c11	Merge pull request #281 from broadinstitute/md_gatklogs Update utilities to get GATKRunReports	2013-06-14 10:00:16 -07:00
Ryan Poplin	c4e508a71f	Merge pull request #275 from broadinstitute/md_fragment_with_pcr Improvements to HaplotypeCaller and NA12878 KB	2013-06-14 09:32:26 -07:00
Mark DePristo	a057f37331	Update utilities to get GATKRunReports -- Critical bugfix: the GATK run reports magically changed names from something like GATK-run-report to GATKRunReport in GATK 2.4. All GATK logs from 2.4 onwards were being eaten by the scripts that download logs, so the GATK usage is actually much much higher than our logs have suggested. Looking forward to seeing some real numbers. Unfortunately the error occurred so early in the downloading process that we actually deleted away these logs, so they cannot be recovered -- Added a step in the downloader that archives the raw, unprocessed files so we can recover from such problems in the future -- The s3 download scripts now download to /local/dev/GATKLogs so will only work on gsa4, but this is ok as this is better than taking forever to get the logs to the isilon. -- Turn off some crazy debugging output from the downloader that was actually masking me from seeing the issue each night -- Make analyzeRunReports.py robust to svn version abominations -- Use python-2.6 in runGATKReport.csh	2013-06-14 10:17:32 -04:00
droazen	ac346a93ba	Merge pull request #278 from broadinstitute/md_gatk_version_in_vcf Emit the GATK version number in the VCF header	2013-06-13 13:22:20 -07:00
Mark DePristo	908183aba7	Merge pull request #277 from broadinstitute/dr_fix_com_sun_dependency Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only	2013-06-13 13:12:45 -07:00
David Roazen	f9c986be74	Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only Problem: Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for all users, the GATK proper should not have any runtime dependencies on this package. Solution: -Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The only users who need to run our doclets are those who compile from source, and they should be competent enough to figure out how to resolve a missing com.sun.* dependency. -HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other tools. -Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils [delivers #51450385] [delivers #50387199]	2013-06-13 15:52:41 -04:00
Mark DePristo	74f311c973	Emit the GATK version number in the VCF header -- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line -- delivers [#51595305]	2013-06-13 15:46:16 -04:00
Mark DePristo	d93bed5d61	Merge pull request #276 from broadinstitute/md_gatkreport_cleanup Remove STANDARD option from GATKRunReport	2013-06-13 12:40:57 -07:00
Mark DePristo	6232db3157	Remove STANDARD option from GATKRunReport -- AWS is now the default. Removed old code the referred to the STANDARD type. Deleted unused variables and functions.	2013-06-13 15:18:28 -04:00
Mark DePristo	dd5674b3b8	Add genotyping accuracy assessment to AssessNA12878 -- Now table looks like: Name VariantType AssessmentType Count variant SNPS TRUE_POSITIVE 1220 variant SNPS FALSE_POSITIVE 0 variant SNPS FALSE_NEGATIVE 1 variant SNPS TRUE_NEGATIVE 150 variant SNPS CALLED_NOT_IN_DB_AT_ALL 0 variant SNPS HET_CONCORDANCE 100.00 variant SNPS HOMVAR_CONCORDANCE 99.63 variant INDELS TRUE_POSITIVE 273 variant INDELS FALSE_POSITIVE 0 variant INDELS FALSE_NEGATIVE 15 variant INDELS TRUE_NEGATIVE 79 variant INDELS CALLED_NOT_IN_DB_AT_ALL 2 variant INDELS HET_CONCORDANCE 98.67 variant INDELS HOMVAR_CONCORDANCE 89.58 -- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously) -- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record: 20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6 Indicating that the call was a HET but the expected result was HOM_VAR -- Forbid subsetting of diploid genotypes to just a single allele. -- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG	2013-06-13 15:05:32 -04:00
Mark DePristo	33720b83eb	No longer merge overlapping fragments from HaplotypeCaller -- Merging overlapping fragments turns out to be a bad idea. In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small. That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference. Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging. Best to leave haplotype construction to the assembly graph.	2013-06-13 15:05:32 -04:00
droazen	fb5143a590	Merge pull request #274 from broadinstitute/md_s3_only GATKRunReport no longer tries to use the Broad filesystem destination, r...	2013-06-13 11:32:31 -07:00
Mark DePristo	dd6e252373	GATKRunReport no longer tries to use the Broad filesystem destination, rather it goes unconditionally to S3	2013-06-13 13:33:10 -04:00
Mark DePristo	c837d67b2f	Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled Relaxing the constraints on the readIsPoorlyModelled function.	2013-06-13 08:40:24 -07:00
Mark DePristo	2833325d31	Merge pull request #272 from broadinstitute/rp_hc_bam_writer_uninformative_reads HC bam writer now sets the read to MQ0 if it isn't informative	2013-06-13 08:08:45 -07:00
Ryan Poplin	f44efc27ae	Relaxing the constraints on the readIsPoorlyModelled function. -- Turns out we were aggressively throwing out borderline-good reads.	2013-06-13 11:06:23 -04:00
Ryan Poplin	d5f0848bd5	HC bam writer now sets the read to MQ0 if it isn't informative -- Makes visualization of read evidence easier in IGV.	2013-06-13 10:11:54 -04:00
Eric Banks	17d3ccb03b	Merge pull request #270 from broadinstitute/rp_reference_haplotype_mismatch_bug Fixing bug with dangling tails in which the tail connects all the way ba...	2013-06-12 11:03:48 -07:00
Ryan Poplin	d1f397c711	Fixing bug with dangling tails in which the tail connects all the way back to the reference source node. -- List of vertices can't contain a source node.	2013-06-12 12:23:01 -04:00
Mark DePristo	b2dc7095ab	Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix Exclude reduced reads from elimination during downsampling	2013-06-11 13:52:28 -07:00
David Roazen	95b5f99feb	Exclude reduced reads from elimination during downsampling Problem: -Downsamplers were treating reduced reads the same as normal reads, with occasionally catastrophic results on variant calling when an entire reduced read happened to get eliminated. Solution: -Since reduced reads lack the information we need to do position-based downsampling on them, best available option for now is to simply exempt all reduced reads from elimination during downsampling. Details: -Add generic capability of exempting items from elimination to the Downsampler interface via new doNotDiscardItem() method. Default inherited version of this method exempts all reduced reads (or objects encapsulating reduced reads) from elimination. -Switch from interfaces to abstract classes to facilitate this change, and do some minor refactoring of the Downsampler interface (push implementation of some methods into the abstract classes, improve names of the confusing clear() and reset() methods). -Rewrite TAROrderedReadCache. This class was incorrectly relying on the ReservoirDownsampler to preserve the relative ordering of items in some circumstances, which was behavior not guaranteed by the API and only happened to work due to implementation details which no longer apply. Restructured this class around the assumption that the ReservoirDownsampler will not preserve relative ordering at all. -Add disclaimer to description of -dcov argument explaining that coverage targets are approximate goals that will not always be precisely met. -Unit tests for all individual downsamplers to verify that reduced reads are exempted from elimination	2013-06-11 16:16:26 -04:00
Ryan Poplin	e1fd3dff9a	Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC Eb calling accuracy improvements to hc	2013-06-11 11:18:51 -07:00
Eric Banks	b63cbd8cc9	Merge pull request #266 from broadinstitute/gda_read_error_correction_new Gda read error correction new	2013-06-11 10:42:06 -07:00
Eric Banks	2c3c680eb7	Misc changes and cleanup from all previous commits in this push. 1. By default, do not include the UG CEU callset for assessment. 2. Updated md5s that are different now with all the HC changes.	2013-06-11 12:53:11 -04:00
Eric Banks	dadcfe296d	Reworking of the dangling tails merging code. We now run Smith-Waterman on the dangling tail against the corresponding reference tail. If we can generate a reasonable, low entropy alignment then we trigger the merge to the reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs and don't let those pass through. Added tests for this implementation that checks exact SW results and correct edges added.	2013-06-11 12:53:04 -04:00

1 2 3 4 5 ...

12498 Commits (28a8d742903dfe3c20d0d854a6168af8d96edc96) All Branches Search

12498 Commits (28a8d742903dfe3c20d0d854a6168af8d96edc96)

All Branches