gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	fdfe4e41d5	Better GATK version and command line output -- Previous version emitted command lines that look like: ##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..." the new version provides additional information on when the GATK was run and the GATK version in a nicer format: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ..."> -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff"> ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff"> -- Removed the ProtectedEngineFeaturesIntegrationTest -- Actual unit tests for these features!	2013-06-20 11:19:13 -04:00
Mark DePristo	0672ac5032	Fix public / protected dependency	2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio	1f8282633b	Removed plots generation from the BaseRecalibration software Improved AnalyzeCovariates (AC) integration test. Renamed AC test files ending with .grp to .table Implementation: * Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly. * Fixed BQSRIntegrationTest to work with new changes	2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio	08f92bb6f9	Added AnalyzeCovariates tool to generate BQSR assessment quality plots. Implemtation details: * Added tool class .AnalyzeCovariates Added convenient addAll method to Utils to be able to add elements of an array. * Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable. * Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output. * Added an integration test class.	2013-06-19 14:38:02 -04:00
Mark DePristo	fb114e34fe	Merge pull request #295 from broadinstitute/dr_remove_PrintReads_ds_argument PrintReads: remove -ds argument	2013-06-19 10:55:10 -07:00
droazen	573ecadecc	Merge pull request #294 from broadinstitute/dr_handle_zero_length_cigar_elements SAMDataSource: always consolidate cigar strings into canonical form	2013-06-19 10:32:22 -07:00
David Roazen	51ec5404d4	SAMDataSource: always consolidate cigar strings into canonical form -Collapses zero-length and repeated cigar elements, neither of which can necessarily be handled correctly by downstream code (like LIBS). -Consolidation is done before read filters, because not all read filters behave correctly with non-consoliated cigars. -Examined other uses of consolidateCigar() throughout the GATK, and found them to not be redundant with the new engine-level consolidation (they're all on artificially-created cigars in the HaplotypeCaller and SmithWaterman classes) -Improved comments in SAMDataSource.applyDecoratingIterators() -Updated MD5s; differences were examined and found to be innocuous -Two tests: -Unit test for ReadFormattingIterator -Integration test for correct handling of zero-length cigar elements by the GATK engine as a whole	2013-06-19 13:29:01 -04:00
David Roazen	23ee192d5e	PrintReads: remove -ds argument -This argument was completely redundant with the engine-level -dfrac argument. -Could produce unintended consequences if used in conjunction with engine-level downsampling arguments.	2013-06-19 13:22:44 -04:00
David Roazen	0be788f0f9	Fix typo in snpEff documentation	2013-06-19 13:15:24 -04:00
Chris Hartl	af275fdf10	Extend the documentation of GenotypeConcordance to include notes about Monomorphic and Filtered VCF records. Address Geraldine's comments - information on moltenization and explanation of fields Fix paren	2013-06-19 12:01:58 -04:00
Mark DePristo	15171c07a8	CatVariants accepts reference files ending in any standard extension -- [resolves #49339235] Make CatVariants accept reference files ending in .fa (not only .fasta)	2013-06-19 11:10:36 -04:00
Mark DePristo	7b22467148	Bugfix: defaultBaseQualities actually works now -- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work. -- [delivers #49538319]	2013-06-17 14:37:27 -04:00
Mark DePristo	b69d210255	Bugfix: allow gzip VCF output in multi-threaded GATK output -- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward. -- [delivers #47399279]	2013-06-17 12:39:18 -04:00
delangel	485ceb1e12	Merge pull request #283 from broadinstitute/md_beagleoutput Simpler FILTER and info field encoding for BeagleOutputToVCF	2013-06-17 09:31:03 -07:00
James Warren	f46f7d9b23	deducing dictionary path should not use global find and replace Signed-off-by: David Roazen <droazen@broadinstitute.org>	2013-06-14 19:15:27 -04:00
Mark DePristo	1677a0a458	Simpler FILTER and info field encoding for BeagleOutputToVCF -- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele. -- Here's an example output line from the previous and current versions: old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88 new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15 -- update MD5s to reflect these changes -- [delivers #50847721]	2013-06-14 15:56:13 -04:00
David Roazen	d167292688	Reduce number of leftover temp files in GATK runs -WalkerTest now deletes .idx files on exit -ArtificialBAMBuilder now deletes .bai files on exit -VariantsToBinaryPed walker now deletes its temp files on exit	2013-06-14 15:56:03 -04:00
Ryan Poplin	c4e508a71f	Merge pull request #275 from broadinstitute/md_fragment_with_pcr Improvements to HaplotypeCaller and NA12878 KB	2013-06-14 09:32:26 -07:00
droazen	ac346a93ba	Merge pull request #278 from broadinstitute/md_gatk_version_in_vcf Emit the GATK version number in the VCF header	2013-06-13 13:22:20 -07:00
Mark DePristo	908183aba7	Merge pull request #277 from broadinstitute/dr_fix_com_sun_dependency Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only	2013-06-13 13:12:45 -07:00
David Roazen	f9c986be74	Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only Problem: Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for all users, the GATK proper should not have any runtime dependencies on this package. Solution: -Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The only users who need to run our doclets are those who compile from source, and they should be competent enough to figure out how to resolve a missing com.sun.* dependency. -HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other tools. -Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils [delivers #51450385] [delivers #50387199]	2013-06-13 15:52:41 -04:00
Mark DePristo	74f311c973	Emit the GATK version number in the VCF header -- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line -- delivers [#51595305]	2013-06-13 15:46:16 -04:00
Mark DePristo	6232db3157	Remove STANDARD option from GATKRunReport -- AWS is now the default. Removed old code the referred to the STANDARD type. Deleted unused variables and functions.	2013-06-13 15:18:28 -04:00
Mark DePristo	dd5674b3b8	Add genotyping accuracy assessment to AssessNA12878 -- Now table looks like: Name VariantType AssessmentType Count variant SNPS TRUE_POSITIVE 1220 variant SNPS FALSE_POSITIVE 0 variant SNPS FALSE_NEGATIVE 1 variant SNPS TRUE_NEGATIVE 150 variant SNPS CALLED_NOT_IN_DB_AT_ALL 0 variant SNPS HET_CONCORDANCE 100.00 variant SNPS HOMVAR_CONCORDANCE 99.63 variant INDELS TRUE_POSITIVE 273 variant INDELS FALSE_POSITIVE 0 variant INDELS FALSE_NEGATIVE 15 variant INDELS TRUE_NEGATIVE 79 variant INDELS CALLED_NOT_IN_DB_AT_ALL 2 variant INDELS HET_CONCORDANCE 98.67 variant INDELS HOMVAR_CONCORDANCE 89.58 -- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously) -- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record: 20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6 Indicating that the call was a HET but the expected result was HOM_VAR -- Forbid subsetting of diploid genotypes to just a single allele. -- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG	2013-06-13 15:05:32 -04:00
Mark DePristo	dd6e252373	GATKRunReport no longer tries to use the Broad filesystem destination, rather it goes unconditionally to S3	2013-06-13 13:33:10 -04:00
Ryan Poplin	f44efc27ae	Relaxing the constraints on the readIsPoorlyModelled function. -- Turns out we were aggressively throwing out borderline-good reads.	2013-06-13 11:06:23 -04:00
Mark DePristo	b2dc7095ab	Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix Exclude reduced reads from elimination during downsampling	2013-06-11 13:52:28 -07:00
David Roazen	95b5f99feb	Exclude reduced reads from elimination during downsampling Problem: -Downsamplers were treating reduced reads the same as normal reads, with occasionally catastrophic results on variant calling when an entire reduced read happened to get eliminated. Solution: -Since reduced reads lack the information we need to do position-based downsampling on them, best available option for now is to simply exempt all reduced reads from elimination during downsampling. Details: -Add generic capability of exempting items from elimination to the Downsampler interface via new doNotDiscardItem() method. Default inherited version of this method exempts all reduced reads (or objects encapsulating reduced reads) from elimination. -Switch from interfaces to abstract classes to facilitate this change, and do some minor refactoring of the Downsampler interface (push implementation of some methods into the abstract classes, improve names of the confusing clear() and reset() methods). -Rewrite TAROrderedReadCache. This class was incorrectly relying on the ReservoirDownsampler to preserve the relative ordering of items in some circumstances, which was behavior not guaranteed by the API and only happened to work due to implementation details which no longer apply. Restructured this class around the assumption that the ReservoirDownsampler will not preserve relative ordering at all. -Add disclaimer to description of -dcov argument explaining that coverage targets are approximate goals that will not always be precisely met. -Unit tests for all individual downsamplers to verify that reduced reads are exempted from elimination	2013-06-11 16:16:26 -04:00
Eric Banks	dadcfe296d	Reworking of the dangling tails merging code. We now run Smith-Waterman on the dangling tail against the corresponding reference tail. If we can generate a reasonable, low entropy alignment then we trigger the merge to the reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs and don't let those pass through. Added tests for this implementation that checks exact SW results and correct edges added.	2013-06-11 12:53:04 -04:00
Mark DePristo	1c03ebc82d	Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP -- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure -- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map -- Cleanup misc. code as it came up -- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC	2013-06-10 16:20:31 -04:00
Mark DePristo	0d593cff70	Refactor rsID and overlap detection in VariantOverlapAnnotator utility class -- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field. -- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation -- Added unit tests for this VariantOverlapAnnotator class -- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator -- Now requires strict allele matching, without any option to just use site annotation.	2013-06-10 15:51:13 -04:00
Valentin Ruano-Rubio	96073c3058	This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK. The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible. Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward: a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC) b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL) Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring. Implementation: * Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work. * Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants. * Added ReadFilterTest class commont parent for ReadFilter test cases. * Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class. * Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality. * Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used. * Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used. * Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior. * Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior	2013-06-10 10:44:42 -04:00
Michael McCowan	00c06e9e52	Performance improvements: - Memoized MathUtil's cumulative binomial probability function. - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.	2013-06-09 11:26:52 -04:00
Mark DePristo	34bdf20132	Bugfix for bad AD values in UG/HC -- In the case where we have multiple potential alternative alleles and we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles. -- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future. -- delivers [#50897077]	2013-06-05 17:48:41 -04:00
Mark DePristo	e19c24f3ee	Bugfix for HaplotypeCaller error: Only one of refStart or refStop must be < 0, not both -- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts. From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads). -- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases. -- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString) -- Added unit tests for reverting soft clipped bases that create a read before the contig -- [delivers #50892431]	2013-06-04 10:33:46 -04:00
Ryan Poplin	ab40f4af43	Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler. -- Added unit test for new function.	2013-06-03 14:00:35 -04:00
Mark DePristo	6555361742	Fix error in merging code in HC -- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested. -- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases. -- delivers #50655421	2013-05-31 16:29:29 -04:00
Mark DePristo	4b206a3540	Check that -compress arguments are within range 0-9 -- Although the original bug report was about SplitSamFile it actually was an engine wide error. The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils -- delivers #49531009	2013-05-31 15:29:02 -04:00
Eric Banks	a96f48bc39	Merge pull request #249 from broadinstitute/rp_hc_gga_mode New implementation of the GGA mode in the HaplotypeCaller	2013-05-31 10:54:50 -07:00
droazen	a665d759cd	Merge pull request #251 from broadinstitute/md_mapq_reassign Command-line read filters are now applied before Walker default filters	2013-05-31 09:05:24 -07:00
Ryan Poplin	b5b9d745a7	New implementation of the GGA mode in the HaplotypeCaller -- We now inject the given alleles into the reference haplotype and add them to the graph. -- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode. -- This unifies how Smith-Waterman is performed between discovery and GGA modes. -- Misc minor cleanup in several places.	2013-05-31 10:35:36 -04:00
Chris Hartl	199476eae1	Three squashed commits: 1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine). This addresses JIRA GSA-767 Unit tests pass (we'll let bamboo deal with the integrations) 2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions). 3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0). After these commits, unit and integration tests now pass, and GSA-767 is done. rebase and fix conflict: public/java/src/org/broadinstitute/sting/utils/MathUtils.java	2013-05-31 00:26:50 -04:00
Mark DePristo	b16de45ce4	Command-line read filters are now applied before Walker default filters -- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 before the BQSR filters them out with MappingQualityUnassignedFilter. -- delivers #50222251	2013-05-30 16:54:18 -04:00
Ryan Poplin	61af37d0d2	Create a new normalDistributionLog10 function that is unit tested for use in the VQSR.	2013-05-30 16:00:08 -04:00
Mark DePristo	56b14be4bc	Merge pull request #247 from broadinstitute/eb_fix_RR_negative_header_problem Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.	2013-05-29 18:10:19 -07:00
Eric Banks	a5a68c09fa	Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads. The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not clips should be ignored. Also rebased so that return type is no longer a Pair. Added unit test to cover this situation.	2013-05-29 16:41:01 -04:00
David Roazen	eb206e9f71	Fix confusing log output from the engine -ReadShardBalancer was printing out an extra "Loading BAM index data for next contig" message at traversal end, which was confusing users and making the GATK look stupid. Suppress the extraneous message, and reword the log messages to be less confusing. -Improve log message output when initializing the shard iterator in GenomeAnalysisEngine. Don't mention BAMs when the are none, and say "Preparing for traversal" rather than mentioning the meaningless-for-users concept of "shard strategy" -These log messages are needed because the operations they surround might take a while under some circumstances, and the user should know that the GATK is actively doing something rather than being hung.	2013-05-29 16:17:04 -04:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
Mark DePristo	d167743852	Archived banded logless PairHMM BandedHMM --------- -- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference Useful infrastructure changes ----------------------------- -- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM -- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class. -- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.	2013-05-22 12:24:00 -04:00
delangel	925232b0fc	Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements 3 simple performance improvements for HaplotypeCaller	2013-05-22 07:58:28 -07:00

1 2 3 4 5 ...

3621 Commits (fdfe4e41d5d8c92fad74f56e654992f3a97ab602)