gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Karthik Gururaj	f6ea25b4d1	Parallel version of the JNI for the PairHMM The JNI treats shared memory as critical memory and doesn't allow any parallel reads or writes to it until the native code finishes. This is not a problem per se it is the right thing to do, but we need to enable -nct when running the haplotype caller and with it have multiple native PairHMM running for each map call. Move to a copy based memory sharing where the JNI simply copies the memory over to C++ and then has no blocked critical memory when running, allowing -nct to work. This version is slightly (almost unnoticeably) slower with -nct 1, but scales better with -nct 2-4 (we haven't tested anything beyond that because we know the GATK falls apart with higher levels of parallelism * Make VECTOR_LOGLESS_CACHING the default implementation for PairHMM. * Changed version number in pom.xml under public/VectorPairHMM * VectorPairHMM can now be compiled using gcc 4.8.x * Modified define-* to get rid of gcc warnings for extra tokens after #undefs * Added a Linux kernel version check for AVX - gcc's __builtin_cpu_supports function does not check whether the kernel supports AVX or not. * Updated PairHMM profiling code to update and print numbers only in single-thread mode * Edited README.md, pom.xml and Makefile for users to pass path to gcc 4.8.x if necessary * Moved all cpuid inline assembly to single function Changed info message to clog from cinfo * Modified version in pom.xml in VectorPairHMM from 3.1 to 3.2 * Deleted some unnecessary code * Modified C++ sandbox to print per interval timing	2014-05-02 19:12:48 -04:00
Valentin Ruano-Rubio	d563072282	Fix for CombineGVCFs and GenotypeGVCFs recurrent exception about missing PLs Story: https://www.pivotaltracker.com/story/show/68220438 Changes: - PL-less input genotypes are now uncalled and so non-variant sites when combining GVCFs. - HC GVCF/BP_RESOLUTION Mode now outputs non-variant sites in sites covered by deletions. - Fixed existing tests Test: - HaplotypeCallerGVCFIntegrationTest - ReferenceConfidenceModelUnitTest - CombineGVCFsIntegrationTest	2014-05-02 09:21:06 -04:00
Ryan Poplin	41d3069213	When we subset PLs because Alleles are removed during genotyping we also need to subset AD.	2014-04-28 15:52:26 -04:00
Ryan Poplin	06dbe74a23	Merge pull request #609 from kcibul/kc_cancersimreads extended SimulateReadsForVariants to optionally use the AF field to indi...	2014-04-28 13:31:56 -04:00
Ami Levy-Moonshine	13dd755468	create a new read transformer that refactor NDN cigar elements to one N element. story: https://www.pivotaltracker.com/story/show/69648104 description: This read transformer will refactor cigar strings that contain N-D-N elements to one N element (with total length of the three refactored elements). This is intended primarily for users of RNA-Seq data handling programs such as TopHat2. Currently we consider that the internal N-D-N motif is illegal and we error out when we encounter it. By refactoring the cigar string of those specific reads, users of TopHat and other tools can circumvent this problem without affecting the rest of their dataset. edit: address review comments - change the tool's name and change the tool to be a readTransformer instead of read filter	2014-04-28 11:29:00 -04:00
Michael McCowan	8290d3c8ac	Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming.	2014-04-22 11:07:13 -04:00
MauricioCarneiro	f03e5ffeb1	Merge pull request #604 from broadinstitute/vrr_hc_omniploidy_general_api Disentangle UG and HC Genotyper engines.	2014-04-20 07:43:23 -04:00
Valentin Ruano-Rubio	7455ac9796	Addressed revisions	2014-04-19 16:48:48 -04:00
Ryan Poplin	a9a48f2459	Merge pull request #607 from broadinstitute/mm_bugfix_raise_mathutils_n_ceiling Support more samples in math utilities.	2014-04-17 13:32:34 -04:00
Joel Thibault	1ab50f4ba8	CatVariants now handles BCF and Block-Compressed VCF [Delivers #67461500]	2014-04-17 12:31:38 -04:00
Kristian Cibulskis	7115cadbd8	extended SimulateReadsForVariants to optionally use the AF field to indicate allele fraction of the simulated event, useful in cancer and other variable ploidy use cases	2014-04-16 16:20:02 -04:00
Joel Thibault	4c74319578	Update for Picard refactoring which improves block-compressed VCF reading [Delivers #69215404]	2014-04-16 14:39:23 -04:00
Joel Thibault	f98df5c071	Integration test for the file extensions CatVariants should handle	2014-04-16 13:25:47 -04:00
Joel Thibault	bdd7024d00	Integration test for block-compressed VCF reading	2014-04-16 13:09:40 -04:00
Joel Thibault	ce770b032a	Move execAndCheck() to ProcessController	2014-04-16 13:09:40 -04:00
Joel Thibault	b197618d13	This comment is no longer true	2014-04-15 15:42:39 -04:00
Mike	f0732d386c	Support more samples in math utilities. - Amend `MathUtils`' constants such that they support callings in excess of 70,000 samples (instead, 100,000).	2014-04-14 12:05:38 -04:00
Valentin Ruano-Rubio	08203b516e	Disentangle UG and HC Genotyper engines. Description: Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed. Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE. Added -ploidy argument although it is contraint to be 2 for now. Added -out_mode EMIT_ALL_SITES\|EMIT_VARIANTS_ONLY ... Added -allSitePLs flag. Stories: https://www.pivotaltracker.com/story/show/68017394 Changes: - Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller engines. (There is still room for improvement though). - Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them. - Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell. - Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments. - GenotypingLikehooldCalculationCModel.Model to Gen.Like.Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the GenLikeCalc classes. - Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines. - Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes. - Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode. Tests : - StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion). - All existing integration and unit tests for modified classes.	2014-04-13 03:09:55 -04:00
Joel Thibault	c84126205b	Test that stdout redirects and log files do not affect output	2014-04-09 13:52:42 -04:00
Joel Thibault	1103fd231a	Better exception message	2014-04-09 10:51:45 -04:00
Eric Banks	b07c0a6b4c	Merge pull request #594 from broadinstitute/dr_vcf_sample_renaming Extend on-the-fly sample renaming feature to vcfs	2014-04-08 11:47:45 -04:00
David Roazen	af6a897479	Extend on-the-fly sample renaming feature to vcfs -Only works with single-sample vcfs -As with bams, the user must provide a file mapping the absolute path to each vcf whose samples are to be renamed to the new sample name for that vcf. The argument is the same as for bams: --sample_rename_mapping_file, and the mapping file may contain a mix of bam and vcf files should the user wish. -It's an error to attempt to remap the sample names of a multi-sample or sites-only vcf -Implemented at the codec level at the instant the vcf header is first read in to minimize the chances of downstream code examining vcf headers/records before renaming occurs. -Integration tests are in sting, unit tests are in picard -Rev picard et. al. to 1.111.1902	2014-04-08 11:07:00 -04:00
Eric Banks	ad336375dc	Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix Addresses issue with strict validation on GVCF files.	2014-04-07 22:10:49 -04:00
Valentin Ruano-Rubio	5afcc8e05f	Change in the command line interface of ValidateVariants. Following reviewers comments the command line interface has been simplified. All extra strict validations are performed by default (as before) and the user has to indicate which one he/she does not want to use with --validationTypeToExclude. Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out. Stories: - https://www.pivotaltracker.com/story/show/68725164 Changes: - Removed validateType argument. - Improved documentation. - Added some warnning log message on suspicious argument combinations. Tests: - ValidateVariantsIntegrationTest#*	2014-04-07 16:27:11 -04:00
Ryan Poplin	7d11b4d5f1	Balancing training classes between SNP/Indel and TP/FP. -- This results in much more consistent distribution of LOD scores for SNPs and Indels. -- Removing genotype summary stats since they are now produced by default. -- Added functionality to specify certain subsets of the training data to be used in Tranche file generation, -good:tranche=true set.vcf	2014-04-07 15:23:53 -04:00
MauricioCarneiro	84861fa10a	Merge pull request #587 from broadinstitute/eb_actually_fail_on_reduced_bams Make sure to fail in all cases where the BAM being used was created by ReduceReads.	2014-04-04 17:27:57 -04:00
Laura Gauthier	ff25b656e1	Added check to make sure file passed in with sample IDs is valid (used in SelectVariants) -- throws UserException. Corresponding test checks for UserException.	2014-04-04 15:38:50 -04:00
Valentin Ruano-Rubio	18deeec6b0	Addresses issue with strict validation on GVCF files. More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples. This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles. To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES' Story: https://www.pivotaltracker.com/story/show/68725164 Changes: - Added validateTypeToExclude argument to ValidateVariants walker. - Implemented the selective exclusion of validation types. - Added new info and improved existing documentation of the ValidateVariants walker. Tests: - ValidateVariantsIntegrationTest#testUnusedAlleleError - ValidateVariantsIntegrationTest#testUnusedAlleleFix	2014-04-04 14:37:10 -04:00
Laura Gauthier	06d78ba068	Expanded documentation to include description of which callsets are being compared in what order and more definitions	2014-04-04 10:35:53 -04:00
Eric Banks	a3d55b3341	Make sure to fail in all cases where the BAM being used was created by ReduceReads. In some cases, the program records were being removed from the BAM headers by the GATK engine before we applied the check for reduced reads (so we did not fail appropriately). Pushed up the check to happen before the PG tags are modified and added a unit test to ensure it stays that way. It turns out that some UG tests still used reduced bams so I switched to use different ones. Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.	2014-04-03 16:52:41 -04:00
Eric Banks	0b73573abc	Slightly modifying the way to use the IUPAC ambiguity codes in the FastaAlternateReferenceMaker. Previously it required you to create a single sample VCF and then to pass that in to the tool, but Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs). Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used for the IUPAC encoding. Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.	2014-04-02 21:34:25 -04:00
Valentin Ruano-Rubio	84711b8e90	Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale) Stories: https://www.pivotaltracker.com/story/show/66263868 Bug: The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in in linear scale computations that ins transformed into an infinity in log scale. Changes: - Change to use log10 scale for calculate those penalties. - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.	2014-04-01 16:14:52 -04:00
Joel Thibault	70fe7f72f1	Return a TabixIndexCreator for appropriate file types [Fixes #68291082]	2014-03-31 16:15:34 -04:00
Joel Thibault	ab5634cbac	Test that a Tabix index is created for block-compressed output formats - Replace .idx and .tbi with appropriate constants	2014-03-31 14:36:48 -04:00
Joel Thibault	a2d40c84ba	Keep the list of zipped suffixes in sync with Variant	2014-03-31 14:36:41 -04:00
Ryan Poplin	6566dd6ca9	Fix for dropping of reference sample depth in the DP annotation. -- In the case of hierarchical merge we can't assume that we have only one genotype. -- Removed use of deprecated VC annotation access functions.	2014-03-24 14:01:50 -04:00
Ryan Poplin	69eaf7c82d	Merge pull request #577 from broadinstitute/eb_minor_fixes_for_fragment_utils Fixed docs for method and fixed the edge case optimization to properly u...	2014-03-21 14:01:44 -04:00
Eric Banks	0d82a70633	Fixed docs for method and fixed the edge case optimization to properly use equals() on Integers. Shouldn't affect actual results at all.	2014-03-20 15:55:09 -04:00
Eric Banks	3b1c337401	Have CombineVariants throw a UserError when trying to combine GVCFs from the HaplotypeCaller. Was previously throwing an IllegalArgumentException (in the wrong place in the code). Error message tells users to use CombineGVCFs.	2014-03-19 19:11:40 -04:00
David Roazen	e549f4a9d2	Fix typo in UtilsUnitTest data provider name This is currently my leading suspect for the cause of the intermittent NoSuchElementException errors on master, since the maven surefire plugin seems unable to handle errors in TestNG DataProviders without blowing up.	2014-03-18 11:52:29 -04:00
David Roazen	4ba72d43cf	Re-enable GATKRunReportUnitTest This test is not, as I had initially thought, the cause of the maven errors. Our master branch is failing intermittently regardless of whether this test is enabled or disabled. This reverts commit 45fc9ff515eec8d676b64a04fb34fb357492ff84.	2014-03-18 09:53:41 -04:00
David Roazen	afa6abe554	Temporarily disable GATKRunReportUnitTest in unstable while maven issues are worked out This test passes when run individually, as part of the commit tests, or as part of the package tests. However, when running the unit tests in isolation it causes maven/surefire to throw a NoSuchElementException. This is clearly a maven/surefire bug or configuration issue. I will re-enable this test on a branch as Khalid and I try to work through it.	2014-03-18 01:28:28 -04:00
David Roazen	2d8653f493	Update pom versions to mark the start of GATK 3.2 development	2014-03-18 01:18:59 -04:00
David Roazen	a6a41c777c	Update pom versions for 3.1	2014-03-18 01:09:29 -04:00
David Roazen	d5e38ec39b	Move GATKRunReport tests from private to public -Hide AWS downloader credentials in a private properties file -Remove references to private ActiveRegion walker Allows phone home functionality to be tested at release time when we are running tests on the release jar.	2014-03-17 18:29:40 -04:00
Eric Banks	2e34ff7692	Merge pull request #563 from broadinstitute/aw_refactor_tribble GATK changes to conform to Tribble refactoring as part improving Tabix s...	2014-03-17 13:35:46 -04:00
Eric Banks	dabdd0a0fd	Remove unused and unnecessary argument	2014-03-17 12:28:27 -04:00
Alec Wysoker	0369f93b24	GATK changes to conform to Tribble refactoring as part improving Tabix support in Tribble (among other things). 1. Enable on-the-fly indexing for vcf.gz. 2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created. 3. Add method setProgressLogger to all SAMFileWriter implementations. 4. Revved picard to 1.109.1722 5. IndelRealigner md5s change because the MC tag is added to records now. Fixed up and signed off by ebanks.	2014-03-17 11:56:22 -04:00
Valentin Ruano-Rubio	2e964c59b4	Improved criteria to select best haplotypes out from the assembly graph. Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge multiplicity sum across their path in the assembly graph. The edge mulitplicity is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it. Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is: As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain. This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates: Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex. This pull-request addresses the following stories: https://www.pivotaltracker.com/story/show/66691418 https://www.pivotaltracker.com/story/show/64319760 Change Summary: 1. Change to the new scoring schema. 2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring. 3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)	2014-03-14 18:37:01 -04:00
Eric Banks	ffaf92f871	Added new functionality to the FastaAlternateReferenceMaker to have it output IUPAC codes for het sites. Enable it with the new --useIUPAC argument. Added both unit and integration tests for the new functionality - and fixed up the exising tests once I was in there.	2014-03-12 14:31:57 -04:00

1 2 3

105 Commits (fb8578ec8e0ffb9a708d4eb0fb9d19f7ca0be7ba)