gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	df6499e58c	Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int. For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing the min MQ filter. Added unit test. Delivers PT#61567540.	2013-11-27 18:55:03 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Eric Banks	0fac4fb3b6	Make the reference model calculation work with reduced reads. It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.	2013-11-21 10:53:33 -05:00
Eric Banks	adb77b406f	Fixed poor implementation of isRefSource() and isRefSink() among others. There was already a note in the code about how wrong the implementation was. The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails. Delivers PT #61069820.	2013-11-21 10:53:27 -05:00
Ami Levy-Moonshine	e6ef37de1d	Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality	2013-11-18 17:29:32 -05:00
MauricioCarneiro	7f08250870	Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2 Improve the PairHMM API for better FPGA integration	2013-11-14 10:47:07 -08:00
bradtaylor	e40a07bb58	Improve the PairHMM API for better FPGA integration Motivation: The API was different between the regular PairHMM and the FPGA-implementation via CnyPairHMM. As a result, the LikelihoodCalculationEngine had to use account for this. The goal is to change the API to be the same for all implementations, and make it easier to access. PairHMM PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap. Added a new primary method that loops the reads and haplotypes, extracts qualities, and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method. Did not alter that method, or its subcompute method, at all. PairHMM also now handles its own (re)initialization, so users don't have to worry about that. CnyPairHMM Added that same new primary access method to this FPGA class. Method overrides the default implementation in PairHMM. Walks through a list of reads. Individual-read quals and the full haplotype list are fed to batchAdd(), as before. However, instead of waiting for every read to get added, and then walking through the reads again to extract results, we just get the haplotype-results array for each read as soon as it is generated, and pack it into a perReadAlleleLikelihoodMap for return. The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not. LikelihoodCalculationEngine The functionality to loop through the reads and haplotypes and get individual log10-likelihoods was moved to the PairHMM, and so removed from here. However, this class does need to retain the ability to pre-process the reads, and post-process the resulting likelihoods map. Those features were separated from running the HMM and refactored into their own methods Commented out the (unused) system for finding best N haplotypes for genotyping. PairHMMIndelErrorModel Similar changes were made as to the LCE. However, in this case the haplotypes are modified based on each individual read, so the read-list we feed into the HMM only has one read.	2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera	dac3dbc997	Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle Clarified caveat for InbreedingCoefficient Cleaned up docstrings for ReduceReads Brushed up doc for ErrorRatePerCycle	2013-11-13 14:33:04 -05:00
Eric Banks	0e3d83d1ef	Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline Improvements to the reference model pipeline.	2013-11-05 06:33:17 -08:00
Eric Banks	09dfaf1a68	Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline Add interpretation to QualifyMissingIntervals	2013-11-05 06:08:13 -08:00
Ryan Poplin	b22c9c2cb4	Improvements to the reference model pipeline. -- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants) -- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.	2013-11-01 17:58:25 -04:00
Mauricio Carneiro	5ed47988b8	Changed the parameter names from cds to baits Making the usage more clear since the parameter is being used over and over to define baited regions. Updated the headers accordingly and made it more readable.	2013-10-24 17:15:56 -04:00
Chris Hartl	9d932e8c60	Merged bug fix from Stable into Unstable	2013-10-10 14:31:33 -04:00
Chris Hartl	6f46d1187a	Remember to copy the integration test changes as well as the walker changes	2013-10-10 14:30:37 -04:00
Mauricio Carneiro	5d6421494b	Fix mismatching number of columns in report Quick fix the missing column header in the QualifyMissingIntervals report. Adding a QScript for the tool as well as a few minor updates to the GATKReportGatherer.	2013-10-09 14:38:15 -04:00
Mauricio Carneiro	63ace685c9	add unit tests	2013-10-04 11:44:07 -04:00
Mauricio Carneiro	839b918f58	Length metric updates to QualifyMissingIntervals * add a length of the overlaping interval metric as per CSER request * standardized the distance metrics to be positive when fully overlapping and the longest off-target tail (as a negative number) when not overlapping * add gatkdocs to the tool (finally!)	2013-10-04 10:18:13 -04:00
Geraldine Van der Auwera	9f7fa247f6	Disable VQSR tranche plots in INDEL mode	2013-09-30 17:14:37 -04:00
Ryan Poplin	ef1d58b7ff	Bugfix for hom ref records that aren't GVCF blocks.	2013-09-29 19:19:26 -04:00
Geraldine Van der Auwera	27808d336a	Minor clarifications regarding ignoreFilter argument	2013-09-26 13:13:53 -04:00
Geraldine Van der Auwera	a9fa5206ee	Merge pull request #399 from broadinstitute/eb_update_docs_for_DepthPerSampleHC Updated docs for DepthPerSampleHC to deliver PT #54237024.	2013-09-25 15:20:19 -07:00
Ryan Poplin	f362597f69	Merge pull request #400 from broadinstitute/mm_bugfix_combine_variants_implicit_casting Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.	2013-09-25 11:47:17 -07:00
Michael McCowan	5113e21437	Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion. * Updated expected test data in which an integer annotation (MQ0) was formatted as a double.	2013-09-25 13:12:02 -04:00
Eric Banks	2783c84c6b	Updated docs for DepthPerSampleHC to deliver PT #54237024 .	2013-09-24 22:32:19 -04:00
Eric Banks	d6992d1263	Updated docs to tell users not to use PCR indel error modeling for PCR free data.	2013-09-23 15:48:47 -04:00
Mauricio Carneiro	5bbad75402	Changing max coverage threshold Because Integer.maxValue is not unit testable	2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera	175388de1d	Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals	2013-09-20 15:12:16 -07:00
Mauricio Carneiro	5e2ffc74fc	Automated interpretation for QualifyMissingIntervals * add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval * add unit tests -- this tool is now public, we need tests. * slightly better docs -- in an effort to produce better docs for this tool	2013-09-20 16:47:12 -04:00
Mauricio Carneiro	74639463b9	Updating excessive coverage default parameter most people don't care about excessive coverage (unless you're very particular about your analysis). Therefore the best possible default value for this is Integer.maxValue so it doesn't get in the way. Itemized Changes: * change maximumCoverage threshold to Integer.maxValue [delivers #57353620]	2013-09-19 23:07:20 -04:00
MauricioCarneiro	014bc4269e	Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation Add Array Logless PairHMM	2013-09-08 20:16:53 -07:00
Ryan Poplin	3503050a39	Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller -- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller. -- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median -- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs -- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model -- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions -- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings -- We only realign reads in the reference model if the read is informative for a particular haplotype over another -- GVCF blocks will now track and output the minimum PLs over the block -- MD5 changes! -- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny -- GVCF tests: from HC changes above and adding in active region trimming	2013-09-06 16:56:34 -04:00
Ryan Poplin	add17dc463	Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error Changed the error for the record size mismatch in the genotyping engine ...	2013-08-30 10:29:54 -07:00
Eric Banks	ea0deb1bb2	Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to handle in the code).	2013-08-30 12:18:19 -04:00
Louis Bergelson	4473b0065e	adding a check for the UNAVAILABLE case of GenotypeType in CountVariants	2013-08-29 17:27:00 -04:00
bradtaylor	0435bbe38f	Retreived PairHMM benchmarks from archive and made improvements PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely. Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords". I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data. It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to: PairHMMBandedEmpiricalBenchmark PairHMMTestData BandedLoglessPairHMMUnitTest I decided against this as the banded benchmark and unit test are archived.	2013-08-28 17:23:35 -04:00
bradtaylor	86fe9fae76	Changes to Array PairHMM to address review comments Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.	2013-08-28 17:23:29 -04:00
bradtaylor	3671e41b0c	Add Array Logless PairHMM A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version. Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix. This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations. We also had to adjust the classes that employ the pairHMM: LikelihoodCalculationEngine (used by HaplotypeCaller) PairHMMIndelErrorModel (used by indel genotyping classes) Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection. The latter affects classes: ErrorModel GeneralPloidyIndelGenotypeLikelihoodsCalculationModel IndelGenotypeLikelihoodsCalculationModel ... all of which use the pairHMM via PairHMMIndelErrorModel	2013-08-28 17:21:23 -04:00
Ryan Poplin	7479152977	Merged bug fix from Stable into Unstable	2013-08-28 12:40:25 -04:00
Ryan Poplin	6bda569666	One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one.	2013-08-28 12:40:08 -04:00
Geraldine Van der Auwera	ed465cd2a5	Fixed a few typos and clarified some doc points.	2013-08-26 17:33:17 -04:00
David Roazen	42d771f748	Remove org.apache.commons.collections.IteratorUtils dependency from the test suite -This was a dependency of the test suite, but not the GATK proper, which caused problems when running the test suite on the packaged GATK jar at release time -Use GATKVCFUtils.readVCF() instead	2013-08-21 19:44:02 -04:00
Eric Banks	d4dc5ba04a	Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records. Added test to make sure this is no longer happening.	2013-08-21 15:46:57 -04:00
Eric Banks	6663d48ffe	Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates Adaptations to accomodate Tribble API changes.	2013-08-19 18:31:02 -07:00
Michael McCowan	c3a933ce84	Adaptations to accomodate Tribble API changes, comprising mostly of the following. * Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable. * Galvanizing fo generic types. * Test fixups, mostly to pass around LineIterators instead of LineReaders. * New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances. * New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).	2013-08-19 15:52:47 -04:00
Eric Banks	17eb7b49fe	Adding ability to use Ryan's PCR error modeling to the Haplotype Caller. There is now a command-line option to set the model to use in the HC. Incorporated Ryan's current (unmerged) branch in for most of these changes. Because small changes to the math can have drastic effects, I decided not to let users tweak the calculations themselves. Instead they can select either NONE, CONSERVATIVE (the default), or AGGRESSIVE. Note that any base insertion/deletion qualities from BQSR are still used. Also, note that the repeat unit x repeat length approach gave very poor results against the KB, so it is not included as an option here.	2013-08-16 01:53:04 -04:00
Eric Banks	69e78efeae	Merge pull request #366 from broadinstitute/gg_gatkdocfixes More gatkdoc fixes	2013-08-13 04:52:03 -07:00
Eric Banks	bcf9a1cda5	Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR Adding mode to VQSR to not output variant records that are filtered out ...	2013-08-12 12:01:50 -07:00
Ryan Poplin	a45011d7e7	Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling.	2013-08-12 11:22:59 -04:00
Ryan Poplin	59f56bef30	Cleaning up help text for the -numBad argument.	2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera	4d20c71e09	Improvements to various gatkdocs - Make -rod required - Document that contaminationFile is currently not functional with HC - Document liftover process more clearly - Document VariantEval combinations of ST and VE that are incompatible - Added a caveat about using MVLR from HC and UG. - Added caveat about not using -mte with -nt - Clarified masking options - Fixed docs based on Erics comments	2013-08-10 10:01:31 -07:00

1 2 3 4 5 ...

882 Commits (df6499e58c0e1a8d48a8d53c6522452b57c65cc9)