gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Geraldine Van der Auwera	19a4bf9ff0	made AR an Advanced argument to discourage basic users from fiddling with it	2013-08-14 14:46:56 -04:00
Geraldine Van der Auwera	a09831489b	Disabled emission of doc URLs for external codecs to avoid broken links	2013-08-10 10:04:04 -07:00
Geraldine Van der Auwera	4d20c71e09	Improvements to various gatkdocs - Make -rod required - Document that contaminationFile is currently not functional with HC - Document liftover process more clearly - Document VariantEval combinations of ST and VE that are incompatible - Added a caveat about using MVLR from HC and UG. - Added caveat about not using -mte with -nt - Clarified masking options - Fixed docs based on Erics comments	2013-08-10 10:01:31 -07:00
Mark DePristo	7aba5a2f9f	Several improvements to AssessNA12878 and KB -- Bugfix for BAMs containing reads without real (M,I,D,N) operators. Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway -- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad. It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves -- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting -- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value. It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs	2013-08-07 08:08:37 -04:00
Mauricio Carneiro	285ab2ac62	Better caching for the HaplotypeCaller Problem ------- Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless. Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting tons of re-compute. Solution ------- Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.	2013-08-02 01:27:29 -04:00
Yossi Farjoun	284176cd7b	moved SnpEffUtilUnitTest to public tree	2013-07-30 17:51:40 -04:00
droazen	b8709b1942	Merge pull request #332 from broadinstitute/st_fpga_hmm FPGA support for PairHMM	2013-07-30 14:21:21 -07:00
Joseph Rose	d2860a5486	Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose)	2013-07-30 15:38:32 -04:00
Chris Hartl	464a5b229d	Add <pre> tags to the Genotype Concordance docs. Tables were not being displayed properly.	2013-07-29 15:48:17 -07:00
Geraldine Van der Auwera	3063d82797	Fixed example in CallableLoci gatkdoc	2013-07-26 15:51:31 -04:00
Geraldine Van der Auwera	fc4a8b1dd0	Fixed example in DoC gatkdoc	2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera	660b075900	Added deprecation notice for SomaticIndelDetector	2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera	5ad99c362d	Added caveat to gatkdocs for MAPQ read transformers & cleaned up AB annotation gatkdocs	2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera	0ea3f8ca58	Added function to gatkdocs to specify what VCF field an annotation goes in (INFO or FORMAT)	2013-07-26 15:51:30 -04:00
Ryan Poplin	8c205dda1b	Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line.	2013-07-26 10:22:43 -04:00
Louis Bergelson	7c43b5f26a	Adding LibraryReadFilter. --Moving LibraryReadFilter which has been part of Mutect into gatk public. --Added an additional check for null values.	2013-07-26 09:32:14 -04:00
Mauricio Carneiro	31ab0824b1	quick indentation fixes to FPGA code	2013-07-24 14:09:49 -04:00
Eric Banks	6df43f730a	Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes. Having them as bytes caused problems for downstream programmers who had data with high MQs.	2013-07-23 23:47:15 -04:00
David Roazen	605a5ac2e3	GATK engine: add ability to do on-the-fly BAM file sample renaming at runtime -User must provide a mapping file via new --sample_rename_mapping_file argument. Mapping file must contain a mapping from absolute bam file path to new sample name (format is described in the docs for the argument). -Requires that each bam file listed in the mapping file contain only one sample in their headers (they may contain multiple read groups for that sample, however). The engine enforces this, and throws a UserException if on-the-fly renaming is requested for a multi-sample bam. -Not all bam files for a traversal need to be listed in the mapping file. -On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders in SAMDataSource (before the headers are even merged), to prevent possible consistency issues. -Renaming is done ONCE at traversal start for each SAMReaders resource creation in the SAMResourcePool; this effectively means once per -nt thread -Comprehensive unit/integration tests Known issues: -if you specify the absolute path to a bam in the mapping file, and then provide a path to that same bam to -I using SYMLINKS, the renaming won't work. The absolute paths will look different to the engine due to the symlink being present in one path and not in the other path. GSA-974 #resolve	2013-07-18 15:48:42 -04:00
David Roazen	c15751e41e	SAMReaderID: fix bug with hash code and equals() method -Two SAMReaderIDs that pointed at the same underlying bam file through a relative vs. an absolute path were not being treated as equal, and had different hash codes. This was causing problems in the engine, since SAMReaderIDs are often used as the keys of HashMaps. -Fix: explicitly use the absolute path to the encapsulated bam file in hashCode() and equals() -Added tests to ensure this doesn't break again	2013-07-15 13:57:00 -04:00
sathibault	0a8f75b953	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java	2013-07-15 08:17:32 -05:00
Eric Banks	b16c7ce050	A whole slew of improvements to the Haplotype Caller and related code. 1. Some minor refactorings and claenup (e.g. removing unused imports) throughout. 2. Updates to the KB assessment functionality: a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call. b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling. 3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived the DeBruijn assembler. 4. Improvements to the likelihoods for the HC a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need to divide e by 3 because the observed base could have come from any of the non-observed alleles. b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are turned into Q0s for now. c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real sites with off kilter allele balances and we occasionally lose them to contamination removal. 5. Improved the dangling tail merging implementation.	2013-07-12 10:09:10 -04:00
Valentin Ruano Rubio	ac77a4c699	Merge pull request #316 from broadinstitute/md_filter_counting Bugfix for counting of applied filters	2013-07-08 10:58:47 -07:00
Eric Banks	921f551426	AnalyzeCovariates is no longer a deprecated tool.	2013-07-08 09:48:12 -04:00
Eric Banks	5f5c90e65c	Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site. Trivial fix, added integration test to cover it.	2013-07-05 00:04:52 -04:00
Mark DePristo	3db02e5ef1	Merge pull request #315 from broadinstitute/md_ref_conf_hc Reference confidence model for the haplotype caller	2013-07-02 13:04:33 -07:00
Mark DePristo	7be01777f6	Bugfix for incPos in GenomeLoc -- Shouldn't have taken a GenomeLoc as an argument, as it's a instance method, not a public static	2013-07-02 15:46:49 -04:00
Mark DePristo	e3e8631ff5	Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences -- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure -- -- Output format looks like: 20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121 20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217 20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240 20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268 20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267 We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty. -- Can we enabled for single samples with --emitRefConfidence (-ERC). -- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads -- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class. -- Includes GVCF writer -- Add 1 mb of WEx data to private/testdata -- Integration tests for reference model output for WGS and WEx data -- Emit GQ block information into VCF header for GVCF mode -- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC -- Control max indel size for the reference confidence model from the command line. Increase default to 10 -- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest -- Unittests for ReferenceConfidenceModel -- Unittests for new MathUtils functions	2013-07-02 15:46:38 -04:00
Mark DePristo	41aba491c0	Critical bugfix for adapter clipping in HaplotypeCaller -- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible. -- Removed the broken logic of determining whether a read adaptor is too long. -- Doesn't require isProperPairFlag to be set for a read to be adapter clipped -- Update integration tests for new adapter clipping code	2013-07-02 15:46:36 -04:00
David Roazen	cdea744b95	Improve -dcov documentation to address recent user confusion -Explicitly state that -dcov does not produce an unbiased random sampling from all available reads at each locus, and that instead it tries to maintain an even representation of reads from all alignment start positions (which, of course, is a form of bias) -Recommend -dfrac for users who want a true across-the-board unbiased random sampling	2013-07-02 15:33:28 -04:00
Mark DePristo	9df58314ab	Bugfix for counting of applied filters -- Because LocusWalkers have multiple filtering streams, each counting filtering independent, and the close() function set calling setFilter on the global result, not on the private counter, which is incorporated into the global (thereby incrementing the counts of each filter). -- [delivers #52667213]	2013-07-01 21:09:48 -04:00
David Roazen	31827022db	Fix pipeline tests that were not respecting the pipeline test dry run setting There are a few pipeline test classes that do not run Queue, but are classified as pipeline tests because they submit farm jobs. Make these unconventional pipeline tests respect the pipeline test dry run setting.	2013-06-28 15:27:17 -04:00
Scott Thibault	82dcdc01c0	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java	2013-06-28 10:13:05 -05:00
David Roazen	94294ed6c4	Move DownsampleReadsQC walker to private	2013-06-25 15:48:44 -04:00
Eric Banks	165b936fcd	Fixing the 'header is negative' problem in Reduce Reads... again. Previous fixes and tests only covered trailing soft-clips. Now that up front hard-clipping is working properly though, we were failing on those in the tool. Added a patch for this as well as a separate test independent of the soft-clips to make sure that it's working properly.	2013-06-24 14:06:21 -04:00
Mark DePristo	fdfe4e41d5	Better GATK version and command line output -- Previous version emitted command lines that look like: ##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..." the new version provides additional information on when the GATK was run and the GATK version in a nicer format: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ..."> -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff"> ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff"> -- Removed the ProtectedEngineFeaturesIntegrationTest -- Actual unit tests for these features!	2013-06-20 11:19:13 -04:00
Mark DePristo	0672ac5032	Fix public / protected dependency	2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio	1f8282633b	Removed plots generation from the BaseRecalibration software Improved AnalyzeCovariates (AC) integration test. Renamed AC test files ending with .grp to .table Implementation: * Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly. * Fixed BQSRIntegrationTest to work with new changes	2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio	08f92bb6f9	Added AnalyzeCovariates tool to generate BQSR assessment quality plots. Implemtation details: * Added tool class .AnalyzeCovariates Added convenient addAll method to Utils to be able to add elements of an array. * Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable. * Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output. * Added an integration test class.	2013-06-19 14:38:02 -04:00
Mark DePristo	fb114e34fe	Merge pull request #295 from broadinstitute/dr_remove_PrintReads_ds_argument PrintReads: remove -ds argument	2013-06-19 10:55:10 -07:00
droazen	573ecadecc	Merge pull request #294 from broadinstitute/dr_handle_zero_length_cigar_elements SAMDataSource: always consolidate cigar strings into canonical form	2013-06-19 10:32:22 -07:00
David Roazen	51ec5404d4	SAMDataSource: always consolidate cigar strings into canonical form -Collapses zero-length and repeated cigar elements, neither of which can necessarily be handled correctly by downstream code (like LIBS). -Consolidation is done before read filters, because not all read filters behave correctly with non-consoliated cigars. -Examined other uses of consolidateCigar() throughout the GATK, and found them to not be redundant with the new engine-level consolidation (they're all on artificially-created cigars in the HaplotypeCaller and SmithWaterman classes) -Improved comments in SAMDataSource.applyDecoratingIterators() -Updated MD5s; differences were examined and found to be innocuous -Two tests: -Unit test for ReadFormattingIterator -Integration test for correct handling of zero-length cigar elements by the GATK engine as a whole	2013-06-19 13:29:01 -04:00
David Roazen	23ee192d5e	PrintReads: remove -ds argument -This argument was completely redundant with the engine-level -dfrac argument. -Could produce unintended consequences if used in conjunction with engine-level downsampling arguments.	2013-06-19 13:22:44 -04:00
David Roazen	0be788f0f9	Fix typo in snpEff documentation	2013-06-19 13:15:24 -04:00
Chris Hartl	af275fdf10	Extend the documentation of GenotypeConcordance to include notes about Monomorphic and Filtered VCF records. Address Geraldine's comments - information on moltenization and explanation of fields Fix paren	2013-06-19 12:01:58 -04:00
Mark DePristo	15171c07a8	CatVariants accepts reference files ending in any standard extension -- [resolves #49339235] Make CatVariants accept reference files ending in .fa (not only .fasta)	2013-06-19 11:10:36 -04:00
Mark DePristo	7b22467148	Bugfix: defaultBaseQualities actually works now -- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work. -- [delivers #49538319]	2013-06-17 14:37:27 -04:00
Mark DePristo	b69d210255	Bugfix: allow gzip VCF output in multi-threaded GATK output -- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward. -- [delivers #47399279]	2013-06-17 12:39:18 -04:00
delangel	485ceb1e12	Merge pull request #283 from broadinstitute/md_beagleoutput Simpler FILTER and info field encoding for BeagleOutputToVCF	2013-06-17 09:31:03 -07:00
James Warren	f46f7d9b23	deducing dictionary path should not use global find and replace Signed-off-by: David Roazen <droazen@broadinstitute.org>	2013-06-14 19:15:27 -04:00

1 2 3 4 5 ...

3659 Commits (e7152e10f7252bac06c700b0d83377e67585d4fb)