gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	9009c1e996	Merge pull request #715 from broadinstitute/vrr_disable_physical_phasing_for_nondiploid_hc Disable physical phasing for non-diploid HC calling.	2014-08-23 20:58:51 -04:00
Valentin Ruano-Rubio	6695aeafd9	Disable physical phasing for non-diploid HC calling. Story: https://www.pivotaltracker.com/story/show/77452256 Changes: If ploidy != 2, disable physical phasing and log an info message to let the user know. Tests: Change md5s affected by this change.	2014-08-23 10:52:07 -04:00
Phillip Dexheimer	931890915f	Add the --sample_name argument to HaplotypeCaller * This is a shortcut for people who have multi-sample BAMs but would like to use GVCF mode. Rather than creating single-sample BAMs with PrintReads, one could use the --sample_name argument to HaplotypeCaller to specify the single sample to make calls on * Completes PT 73075482	2014-08-22 23:22:03 -04:00
Valentin Ruano-Rubio	fc5ce4b662	Created the stand-alone AC and AF annotation AlleleCountBySample Story: https://www.pivotaltracker.com/story/show/77250524 Changes: - Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class. - Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code). - Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize - Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently. Test: - Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext). - Added unit-testing for new methods in Utils : asList(int[]) and asList(double[]) - Changed UG General Ploidy test to add explicitly those annotations. - Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing those annotations anylonger, so the MD5s have been changed accordingly.	2014-08-22 20:33:25 -04:00
Eric Banks	36bdfa3918	Merge pull request #712 from broadinstitute/eb_physical_phasing_bug_PT77248992 Fixing bug in the physical phasing code, found by Valentin.	2014-08-21 15:25:51 -04:00
Eric Banks	b1cb6196be	Fixing bug in the physical phasing code, found by Valentin. It turns out that there can be some really complex situations even with a single sample where there are lots of unphasable hets around a hom. Previously we were trying to phase each of the hets against the hom, but that wasn't correct. Instead we now detect that situation and don't attempt to phase anything. Added a unit test to cover this situation.	2014-08-21 15:24:09 -04:00
Laura Gauthier	9a5da41dd4	Add bells and whistles for Genotype Refinement Pipeline New annotation for low= and high-confidence de novos (only annotates biallelics) FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations Restrict population priors based on discovered allele count to be valid for 10 or more samples.	2014-08-21 11:20:40 -04:00
Valentin Ruano-Rubio	d31c5536aa	Fixed the bug first by indicating the actual possible number of alternatives alleles considering the extra <NON_REF> and second by resizing the StateTracker capacity when invoked by GeneralPloidyExactAFCalc deep within its implementation of computeLog10PNonRef which is ultimatelly what get rids of the exception. Story: https://www.pivotaltracker.com/story/show/74471252	2014-08-20 14:42:42 -04:00
Laura Gauthier	b512c7eac9	Refactor StrandBiasTest (using template method) and add warnings for when annotations may not be calculated successfully. VariantAnnotator/FS behavior changes slightly: VA used to output zeros for FS if there was no strand bias info, now skips FS output (but will still show FS in header)	2014-08-20 08:18:53 -04:00
Valentin Ruano-Rubio	8d9a55ae60	Moving new omniploidy likelihood calculation classes to their final package (as far as this pull-request is concerned) in org.broadinstitute.gatk.tools.walkers.genotyper	2014-08-19 11:54:29 -04:00
Valentin Ruano-Rubio	611b7f25ea	Adds unit-test and integration test for new omniploidy likelihood calculation components Added md5 to HaplotypeCallerIntegrationTest.testHaplotypeCallerSingleSampleWithDbsnp	2014-08-19 11:53:19 -04:00
Valentin Ruano-Rubio	9ee9da36bb	Generalize the calculation of the genotype likelihoods in HC to cope with haploid and multiploidy Changes in several walker to use new sample, allele closed lists and new GenotypingEngine constructors signatures Rebase adoption of new calculation system in walkers	2014-08-19 11:53:06 -04:00
Valentin Ruano-Rubio	f08dcbc160	Added the genotype likelihoods model interface and implementation for the random speciment sample from an infinite population with homogeneous ploidy accross samples.	2014-08-19 11:50:13 -04:00
Valentin Ruano-Rubio	4f993e8dbe	Added read-likelihoods array base structure to substitute existing Map-of-Map-of-Maps.	2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio	242cd0e58f	Added genotype allele counts and likelihood calculator utilities for arbitrary ploidy and number of alleles	2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio	b0a4cb9f0c	Added close sample and allele list data-structures and utility classes	2014-08-19 11:50:12 -04:00
Eric Banks	d3f06024f8	Updated the physical phasing in the Haplotype Caller to address requests from ATGU. 1. It is now turned on by default 2. It now phases homozygous variants 3. Most importantly, it also phases variants that are always on opposite haplotypes Changed the INFO keys to be PID and PGT, as described in the header.	2014-08-18 14:38:29 -04:00
Eric Banks	7e0c326e1c	Merge pull request #706 from broadinstitute/vrr_reduce_hc_integration_test_time Reduce intervals of integration tests in HaplotypeCallerIntegrationTest ...	2014-08-15 17:37:57 -04:00
Valentin Ruano-Rubio	2f79042dee	Reduce intervals of integration tests in HaplotypeCallerIntegrationTest class Story: https://www.pivotaltracker.com/story/show/74858854 Changes: Intervals have been shrunk so that the test run in 15s or less.	2014-08-15 14:20:10 -04:00
Eric Banks	eb84091702	Update the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.	2014-08-14 15:34:09 -04:00
Ryan Poplin	3a9a78c785	Removing an assumption that ADs were in the same order if the number of alleles matched. This happens for example when one sample is C->T and another sample is C->G.	2014-08-13 13:26:40 -04:00
Eric Banks	27193c5048	Merge pull request #700 from broadinstitute/eb_phase_HC_variants_PT74816060 Initial implementation of functionality to add physical phasing informat...	2014-08-13 12:30:32 -04:00
Eric Banks	4512940e87	Initial implementation of functionality to add physical phasing information to the output of the HaplotypeCaller. If any pair of variants occurs on all used haplotypes together, then we propagate that information into the gVCF. Can be enabled with the --tryPhysicalPhasing argument.	2014-08-13 12:25:31 -04:00
Valentin Ruano-Rubio	b39508cd15	ReadLikelihoods class introduction final changes before merging Stories: https://www.pivotaltracker.com/story/show/70222086 https://www.pivotaltracker.com/story/show/67961652 Changes: Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM. Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set. Updated some integration test md5s.	2014-08-11 17:47:25 -04:00
Valentin Ruano-Rubio	9a9a68409e	ReadLikelihoods class introduction final changes before merging Stories: https://www.pivotaltracker.com/story/show/70222086 https://www.pivotaltracker.com/story/show/67961652 Changes: Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM. Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set. Updated some integration test md5s. Fixing GraphBased bugs with new master code Fixed ReadLikelihoods.changeReads difficult to spot bug. Changed PairHMM interface to fix a bug Fixed missing changes for various PairHMM implementations to get them to use the new structure. Fixed various bugs only detectable when running with full sample(s). Believe to have fixed the lack of annotations in UG runs Fixed integrationt test MD5s Updating some md5s Fixed yet another md5 probably left out by mistake	2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio	0b472f6bff	Added new test to verify the functionality of ReadLikelihoods.java and its use in HC. Updated existing integration test md5s. Stories: https://www.pivotaltracker.com/story/show/70222086 https://www.pivotaltracker.com/story/show/67961652	2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio	2914ecb585	Change the Map-of-maps-of-maps for an array based implementation ReadLikelihoods to hold read likelihoods. The array structure should be faster to populate and query (no properly benchmarked) and reduce memory footprint considerably. Nevertheless removing PairHMM factor (using likelihoodEngine Random) it only achieves a speed up of 15% in some example WGS dataset i.e. there are other bigger bottle necks in the system. Bamboo tests also seem to run significantly faster with this change. Stories: https://www.pivotaltracker.com/story/show/70222086 https://www.pivotaltracker.com/story/show/67961652 Changes: - ReadLikelihoods added to substitute Map<String,PerSampleReadLikelihoods> - Operation that involve changes in full sets of ReadLikelihoods have been moved into that class. - Simplified a bit the code that handles the downsampling of reads based on contamination Caveats: - Still we keep Map<String,PerReadAlleleLikelihoodsMap> around to pass to annotators..., didn't feel like change the interface of so many public classes in this pull-request.	2014-08-11 17:46:28 -04:00
Ryan Poplin	c56e493f98	Merge pull request #622 from broadinstitute/ldg_SORanalysis Add StrandOddsRatio to default annotations produced by GenotypeGVCFs	2014-08-11 09:45:27 -04:00
Tim Fennell	5695f22da8	Changed the default GVCF Q Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution for homref PLs and ADs at lower confidences and somewhat higher resolution at higher confidences.	2014-08-08 14:31:35 -04:00
Laura Gauthier	35de598e4b	Modify StrandOddsRatio calculation to take on lower values in cases where reference +/- reads are skewed but alt reads are not. Add SOR to default annotations produced by GenotypeGVCFs. Add jitter to minimum SOR values	2014-08-07 12:09:19 -04:00
Laura Gauthier	f532f1f843	Fix nullPointerException	2014-08-07 10:13:17 -04:00
Laura Gauthier	74affcc077	Update inbreeding coefficient calculation to give a better estimate for multialleleic sites Add unit test for compound het and for multiallelic hets	2014-08-07 08:12:47 -04:00
Eric Banks	b9486f5b4d	Merge pull request #693 from broadinstitute/ldg_SORfromHC Allow SOR to be calculated from HC	2014-08-06 21:48:09 -04:00
Phillip Dexheimer	593663d9b6	Improved detection of missing argument values In particular, it was possible to specify arguments for Files or Compound types without values Added a special "none" value for annotations, since a bare "-A" is no longer allowed Delivers PT 71792842 and 59360374	2014-08-05 20:31:31 -04:00
Laura Gauthier	5533199402	Allow SOR to be calculated from HC Refactor StrandBiasTest classes	2014-08-01 20:47:58 -04:00
Ryan Poplin	63b3f7dfd3	Fixing typos in AnalyzeCovariates	2014-07-31 10:36:18 -04:00
Valentin Ruano-Rubio	750eb4b5a6	Add diploid only support message to HaplotypeCaller Story: https://www.pivotaltracker.com/story/show/73440292 Changes: - Just add the conditional in HaplotypeCaller#initialize Testing: - Nothing added, checked locally, trivial change that would eventually be removed anyway.	2014-07-29 17:05:36 -04:00
David Roazen	0798a4b768	Update pom versions to mark the start of GATK 3.3 development	2014-07-17 12:09:33 -04:00
David Roazen	323f22f852	Update pom versions for the 3.2 release	2014-07-17 12:06:22 -04:00
Eric Banks	98d88eb07e	Fixed IndexOutOfBounds error associated with tail merging. Don't expand out source nodes for tail merging, since that's a head merging action only. This shows up as a bug only because we now allow merging tails against non-reference paths.	2014-07-17 12:04:22 -04:00
Geraldine Van der Auwera	a6f632874b	Various documentation improvements - Edited intervals merging docs for correctness & clarity - Edited VQSR arg docs and made mode required (+added -mode SNP to VQSR tests) - Moved PaperGenotyper to Toy Walkers to declutter the actually useful docs - Moved GenotypeGVCFs to Variant Discovery category and clarified a few points - Clarified that the -resource argument depends on using the -V:tag format - Clarified how the pcr indel model works - Added caveat for -U ALLOW_N_CIGAR_READS - Added MathJax support for displaying equations in GATKDocs - Updated HC example commands and caveats	2014-07-14 12:03:03 -04:00
droazen	db53d096c9	Merge pull request #684 from broadinstitute/ks_add_cofoja_to_gatk_packages Added cofoja to the gatk packages for tests to pass.	2014-07-14 11:15:49 -04:00
Eric Banks	ecefcb383d	Disable the complex variant merging for now, as requested by ATGU	2014-07-11 17:27:40 -04:00
Khalid Shakir	c7e357eb59	Added cofoja to the gatk packages for tests to pass.	2014-07-11 23:19:42 +08:00
droazen	b8751ad598	Merge pull request #680 from broadinstitute/ldg_VQSRscript Update VQSR Rnd BQSR script generation code for compatibility with late...	2014-07-11 10:16:37 -04:00
Eric Banks	1d97b4a191	Improved tail merging: now tails can be merged to branches that are not entirely reference. This is useful for e.g. cases where there are SNPs on insertions. Before tails were forced to be merged (incorrectly) only to a reference node, but now they can be merged to any path in the graph from which they directly branch. Also, I've transferred over Ryan's code to refuse to process kmer sizes such that there are non-unique kmers in the reference sequence with them.	2014-07-10 08:57:01 -04:00
Ryan Poplin	5eee065133	Merge pull request #674 from broadinstitute/rp_improve_genotyping Improvements to genotyping accuracy.	2014-07-09 16:03:09 -04:00
Laura Gauthier	99026eb51b	Update VQSR Rnd BQSR script generation code for compatibility with latest ggplot version. Update queueJobReport.R and public/gsalib/src/R/R/gsa.variantqc.utils.R also	2014-07-09 15:36:58 -04:00
Ryan Poplin	74a7674d70	Improvements to genotyping accuracy. -- Global mismapping penalty was only applied to the reference haplotype. This led to problems with overlapping events, mostly STR haplotypes. Now the penalty is applied to every haplotype. -- We subset the reads down to only those which overlap the event (after assembly based realignment) for likelihood calculations.	2014-07-09 13:11:07 -04:00
David Roazen	719e685759	Remove junit imports in the test suite	2014-07-09 12:09:27 -04:00
Eric Banks	bad7865078	When converting a haplotype to a set of variants we now check for cases that are overly complex. In these cases, where the alignment contains multiple indels, we output a single complex variant instead of the multiple partial indels. We also re-enable dangling tail recovery by default.	2014-07-01 14:18:59 -04:00
Ryan Poplin	e14bff212d	SB tables should be created even if the ref or alt columns have no counts. This is so that FS/SOR will still be calculated when the variant is extremely high or low frequency. -- Removed long running HC integration test... sorry	2014-06-30 15:19:15 -04:00
Ryan Poplin	0127799cba	Reads are now realigned to the most likely haplotype before being used by the annotations. -- AD,DP will now correspond directly to the reads that were used to construct the PLs -- RankSumTests, etc. will use the bases from the realigned reads instead of the original alignments -- There is now no additional runtime cost to realign the reads when using bamout or GVCF mode -- bamout mode no longer sets the mapping quality to zero for uninformative reads, instead the read will not be given an HC tag	2014-06-30 10:35:50 -04:00
Phillip Dexheimer	06d619e9aa	Removed redundant SelectVariantsIntegrationTest, merged it's only test into protected version	2014-06-24 18:59:59 -04:00
Eric Banks	2df2a153e6	Merge pull request #658 from broadinstitute/ldg_PbyTwithPriors Updated CalculateGenotypePosteriors to compute genotype posteriors using...	2014-06-18 15:04:39 -04:00
Laura Gauthier	2356d5d63f	Updated CalculateGenotypePosteriors to compute genotype posteriors using likelihoods from all members of the trio. (Right now it only works if all members of the trio are called.) Takes posteriors as input, defaulting to PLs Added annotations for possible de novos for us in full genotype refinement pipeline Added family priors to CGP integration test. Changed CGP to use PP tag instead of GP tag because posteriors are Phred-scaled. Updated CGP integration test md5s to reflect change.	2014-06-18 11:17:15 -04:00
Phillip Dexheimer	2e78815055	Added missing arguments to GenotypeGVCFs - New arguments are nda, hets, indelHeterozygosity, stand_call_conf, stand_emit_conf, ploidy, and maxAltAlleles - Addresses PT 70110918 - To do this, moved those arguments out of the StandardCallerArgumentCollection into a new GenotypeCalculationArgumentCollection, which is now included as a member of SCAC	2014-06-16 08:10:54 -04:00
droazen	3079755b4c	Merge pull request #646 from broadinstitute/ks_disable_distribution_with_private Add maven -Pgsadev flag to build private jars only	2014-06-11 11:00:31 -04:00
Khalid Shakir	f082572593	If passed -Pgsadev, don't build the distribution package.	2014-06-10 23:33:33 -04:00
Valentin Ruano Rubio	db96891d4b	Merge pull request #638 from broadinstitute/vrr_createTempFile_testfix Changed File.createTempFile to BaseTest.createTempFile calls Test	2014-05-29 10:15:05 -04:00
Valentin Ruano-Rubio	07567fdae3	Removed debug code outputing files not removed after VM exists in ReadThreadingLikelihoodCalculationEngineUnitTest. Notice however that this should not be the cause of resent problems as the code was desactivated.	2014-05-28 19:03:25 -04:00
Valentin Ruano-Rubio	e0c221470c	Changed File.createTempFile to BaseTest.createTempFile	2014-05-28 18:59:48 -04:00
EvolvedMicrobe	ef7531d4a5	Merge pull request #640 from broadinstitute/IntegerSWImplementation Change SmithWaterman to use integers instead of doubles.	2014-05-28 15:10:05 -04:00
Nigel Delaney	cc45e62e8e	Change SmithWaterman to use integers instead of doubles.	2014-05-28 13:13:14 -04:00
droazen	ac52fa581a	Merge pull request #644 from broadinstitute/ks_queue_test_temp_fix Disabled ExampleUG Queue Tests, fixed internal extensions dependency.	2014-05-28 11:29:08 -04:00
Phillip Dexheimer	c15e6fcc0e	Refactored the static lookup arrays in MathUtils (log10Cache, log10FactorialCache, jacobianLogTable) -They are now only computed when necessary -Log10Cache is dynamically resizable, either by calling get() on an out-of-range value or by calling ensureCacheContains -Log10FactorialCache and JacobianLogTable are initialized to a fixed size on first access and are not resizable -Addresses PT 69124396	2014-05-27 22:27:57 -04:00
Eric Banks	b77589696e	Merge pull request #643 from broadinstitute/rp_remove_hwp Removing HWP from GenotypeSummaries because of integer overflow issues w...	2014-05-27 17:21:19 -04:00
Khalid Shakir	6c9e68ef41	Disabled ExampleUG Queue Tests, fixed internal extensions dependency. EUG tests disabled due to new protected qscript directory path, post GATK artifact splitting.	2014-05-27 16:16:53 -04:00
David Roazen	74b51c5c7a	Improve test suite tmp file cleanup -Make BaseTest.createTempFile() mark any possible corresponding index files for deletion on exit -Make WalkerTest mark shadow BCF files and auxiliary for deletion on exit -Make VariantRecalibrationWalkersIntegrationTest mark PDF files for deletion on exit	2014-05-27 13:41:44 -04:00
Ryan Poplin	b24cff780b	Removing HWP from GenotypeSummaries because of integer overflow issues with 91K samples. Removing CCC because it is redundant.	2014-05-27 10:14:49 -04:00
Ryan Poplin	ec7c4ea2ba	Unfortunately dangling tail recovery is dangerous in exome data. Turning it off by default for now. -- disabling HC+VA integration test because, as noted in the comments, it keeps switching PairHMM implementations and giving different results at a particular site used in that particular test	2014-05-23 14:33:44 -04:00
Valentin Ruano-Rubio	979ab0453e	Moved GlobalEdgeGreedySWPairwiseAlignment to the archive	2014-05-23 01:48:48 -04:00
Valentin Ruano-Rubio	7c8a1ae892	Fix for SW to make double comparisons with a tolerance Stories: - https://www.pivotaltracker.com/story/show/69577868 Changes: - Added a epsilon difference tolerance in weight comparisons. Tests: - Added HaplotypeCallerIntegrationTest#testDifferentIndelLocationsDueToSWExactDoubleComparisonsFix - Updated md5 due to minor likelihood changes. - Disabled a test for PathUtils.calculateCigar since does not work and is unclear what is causing the error (needs original author input)	2014-05-23 01:48:48 -04:00
Khalid Shakir	b7e98bdae9	Fixed GATK docs artifact, moved protected ExampleUG tests.	2014-05-22 21:03:55 -04:00
Ryan Poplin	581843d994	Minor updates to HC docs.	2014-05-20 10:01:11 -04:00
Khalid Shakir	88d7e23c44	After talking with Mauricio and Karthik, updated MD5s and added a note about PairHMM causing test variability.	2014-05-19 17:36:41 -04:00
Karthik Gururaj	972a82d386	Changed 'sting' to 'gatk' in the VectorLoglessPairHMM classes and the C++ code	2014-05-19 17:36:41 -04:00
Khalid Shakir	3939971d78	After renaming the packages, instead of updating the JNI library used for testing bwa, moving the classes to the archive. NOTE: The migrated READEME.md has been added that will allow others to possibly ressurect this code as needed.	2014-05-19 17:36:41 -04:00
Khalid Shakir	2c854e554a	Refactored maven directories and java packages replacing "sting" with "gatk". To reduce merge conflicts, this commit modifies contents of files, while file renamings are in previous commit. See previous commit message for list of changes.	2014-05-19 17:36:39 -04:00
Khalid Shakir	4e6d43d003	Refactored maven directories and java packages replacing "sting" with "gatk". To reduce merge conflicts, this commit only renames files, while file modifications are in next commit. Some updates/fixes here are actually included in the next commit. = Maven updates Moved artifacts to new package names: * private/queue-private -> private/gatk-queue-private * private/gatk-private -> private/gatk-tools-private * public/gatk-package -> protected/gatk-package-distribution * public/queue-package -> protected/gatk-queue-package-distribution * protected/gatk-protected -> protected/gatk-tools-protected * public/queue-framework -> public/gatk-queue * public/gatk-framework -> public/gatk-tools-public New poms for new artifacts and packages: * private/gatk-package-internal * private/gatk-queue-package-internal * private/gatk-queue-extensions-internal * protected/gatk-queue-extensions-distribution * public/gatk-engine Updated references to StingText.properties to GATKText.properties. Updated ant-bridge.sh to use gatk.* properties instead of sting.. = Engine updates Renaming files containing engine parts from o.b.gatk.tools to o.b.gatk.engine. Changed package references from tools to engine for CommandLineGATK, GenomeAnalysisEngine, ReadMetrics, ReadProperties, and WalkerManager. Changed package reference tools.phonehome to engine.phonehome. Renamed classes Sting* to GATK, such as ReviewedGATKException. = Test updates Moved gatk example resources. Moved test engine files from tools to engine packages. Moved resources for phonehome to proper package. Moved test classes under o.b.gatk into packages: * o.b.g.utils.{BaseTest,ExampleToCopyUnitTest,GATKTextReporter,MD5DB,MD5Mismatch,TestNGTestTransformer} * o.b.g.engine.walkers.WalkerTest Updated package names in DependencyAnalyzerOutputLoaderUnitTest's data. = Queue updates Moving queue scripts to location where generated extensions can be used. Renamed .q to .scala, updating licenses previously missed by git hooks. Moved queue extensions to new artifact gatk-queue-extensions. Fixed import statments frequently merge-conflicting on FullProcessingPipeline.scala. = BWA Added README on how to obtain and include bwa as a library. Updated libbwa build. Fixed packaged names under bwa/java implementation. Updated contents of BWCAligner native implementation. = Other fixes Don't duplicate the resource bundle entries by both unpacking and appending. (partial fix) Staged engine and utils poms to build GATKText.properties, once Utils random generator dependency on GATK engine is fixed. Re-enabled custom testng listeners/reporters and moved testng dependencies to the gatk-root. Updated comments referencing Sting with GATK. Moved a couple untangled classes from gatk-tools-public to gatk-utils and gatk-engine.	2014-05-19 16:43:47 -04:00
Khalid Shakir	67e44985b1	Java/Scala imports updated for new package names. Fourth of four commits for picard/htsjdk package rename.	2014-05-08 19:13:31 +08:00
Laura Gauthier	bf7b97393e	Add ability to output to a file discordant loci and their respective genotypes for each sample	2014-05-07 10:12:45 -04:00
MauricioCarneiro	f03a12263a	Merge pull request #625 from broadinstitute/intel_updateCell_inlined (Optional) Inlined the code from updateCell	2014-05-07 10:11:09 -04:00
Karthik Gururaj	d9c489f928	Removed scary warning messages for VectorPairHMM	2014-05-06 10:59:24 -07:00
Karthik Gururaj	fb8578ec8e	Inlined the code from updateCell - helps Java JIT to detect hotspots and produce good native code	2014-05-06 10:37:10 -07:00
Karthik Gururaj	f6ea25b4d1	Parallel version of the JNI for the PairHMM The JNI treats shared memory as critical memory and doesn't allow any parallel reads or writes to it until the native code finishes. This is not a problem per se it is the right thing to do, but we need to enable -nct when running the haplotype caller and with it have multiple native PairHMM running for each map call. Move to a copy based memory sharing where the JNI simply copies the memory over to C++ and then has no blocked critical memory when running, allowing -nct to work. This version is slightly (almost unnoticeably) slower with -nct 1, but scales better with -nct 2-4 (we haven't tested anything beyond that because we know the GATK falls apart with higher levels of parallelism * Make VECTOR_LOGLESS_CACHING the default implementation for PairHMM. * Changed version number in pom.xml under public/VectorPairHMM * VectorPairHMM can now be compiled using gcc 4.8.x * Modified define-* to get rid of gcc warnings for extra tokens after #undefs * Added a Linux kernel version check for AVX - gcc's __builtin_cpu_supports function does not check whether the kernel supports AVX or not. * Updated PairHMM profiling code to update and print numbers only in single-thread mode * Edited README.md, pom.xml and Makefile for users to pass path to gcc 4.8.x if necessary * Moved all cpuid inline assembly to single function Changed info message to clog from cinfo * Modified version in pom.xml in VectorPairHMM from 3.1 to 3.2 * Deleted some unnecessary code * Modified C++ sandbox to print per interval timing	2014-05-02 19:12:48 -04:00
Valentin Ruano-Rubio	d563072282	Fix for CombineGVCFs and GenotypeGVCFs recurrent exception about missing PLs Story: https://www.pivotaltracker.com/story/show/68220438 Changes: - PL-less input genotypes are now uncalled and so non-variant sites when combining GVCFs. - HC GVCF/BP_RESOLUTION Mode now outputs non-variant sites in sites covered by deletions. - Fixed existing tests Test: - HaplotypeCallerGVCFIntegrationTest - ReferenceConfidenceModelUnitTest - CombineGVCFsIntegrationTest	2014-05-02 09:21:06 -04:00
Ryan Poplin	41d3069213	When we subset PLs because Alleles are removed during genotyping we also need to subset AD.	2014-04-28 15:52:26 -04:00
Ryan Poplin	06dbe74a23	Merge pull request #609 from kcibul/kc_cancersimreads extended SimulateReadsForVariants to optionally use the AF field to indi...	2014-04-28 13:31:56 -04:00
Ami Levy-Moonshine	13dd755468	create a new read transformer that refactor NDN cigar elements to one N element. story: https://www.pivotaltracker.com/story/show/69648104 description: This read transformer will refactor cigar strings that contain N-D-N elements to one N element (with total length of the three refactored elements). This is intended primarily for users of RNA-Seq data handling programs such as TopHat2. Currently we consider that the internal N-D-N motif is illegal and we error out when we encounter it. By refactoring the cigar string of those specific reads, users of TopHat and other tools can circumvent this problem without affecting the rest of their dataset. edit: address review comments - change the tool's name and change the tool to be a readTransformer instead of read filter	2014-04-28 11:29:00 -04:00
Ryan Poplin	221b999cb0	GenotypeGVCF was pulling the headers from all input rods including DBsnp. Now it pulls from just the input variant rods.	2014-04-25 13:16:28 -04:00
Laura Gauthier	9f3cbb2ef1	Improvements to CalculateGenotypePosteriors and CalibrateGenotypeLikelihoods CalculateGenotypePosteriors now only computes posterior probs for SNP sites with SNP priors (other sites have flat priors applied) CalibrateGenotypeLikelihoods had originally applied HOM_REF/HET/HOM_VAR frequencies in callset as priors before empirical quality analysis. Now has option (-noPriors) to not apply/apply flat priors. Also takes in new external probabilities files, such as those generated by CGP, from which the genotype posterior probability qualities will be read. Integration test was changed to account for new SNP-only behavior and default behavior to not use missing priors. (Also, new numRefIfMissing is 0, which should only matter in cases using few samples when you probably don't want to be doing that anyway!)	2014-04-24 08:49:42 -04:00
Valentin Ruano-Rubio	e610373169	Fixed integration test problems from previous premature merge	2014-04-20 17:11:51 -04:00
Valentin Ruano-Rubio	4e5850966a	Reengineer engine constructors	2014-04-19 17:58:14 -04:00
Valentin Ruano-Rubio	7455ac9796	Addressed revisions	2014-04-19 16:48:48 -04:00
Kristian Cibulskis	6b9e38c8bb	incorporated comments from review, made variables final, made AF paramater hidden, and added bounds checking to AF value	2014-04-16 19:29:25 -04:00
Kristian Cibulskis	7115cadbd8	extended SimulateReadsForVariants to optionally use the AF field to indicate allele fraction of the simulated event, useful in cancer and other variable ploidy use cases	2014-04-16 16:20:02 -04:00
Valentin Ruano-Rubio	08203b516e	Disentangle UG and HC Genotyper engines. Description: Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed. Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE. Added -ploidy argument although it is contraint to be 2 for now. Added -out_mode EMIT_ALL_SITES\|EMIT_VARIANTS_ONLY ... Added -allSitePLs flag. Stories: https://www.pivotaltracker.com/story/show/68017394 Changes: - Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller engines. (There is still room for improvement though). - Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them. - Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell. - Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments. - GenotypingLikehooldCalculationCModel.Model to Gen.Like.Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the GenLikeCalc classes. - Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines. - Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes. - Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode. Tests : - StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion). - All existing integration and unit tests for modified classes.	2014-04-13 03:09:55 -04:00
Khalid Shakir	a6b0754990	After comments from @nh13, updated latest picard and setMateInfo call.	2014-04-08 15:22:45 -04:00
Khalid Shakir	3047d6ff32	BQSRGatherer handles missing read groups from some input files. [#68720468 ]	2014-04-08 23:58:54 +08:00
Eric Banks	ad336375dc	Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix Addresses issue with strict validation on GVCF files.	2014-04-07 22:10:49 -04:00
Valentin Ruano-Rubio	5afcc8e05f	Change in the command line interface of ValidateVariants. Following reviewers comments the command line interface has been simplified. All extra strict validations are performed by default (as before) and the user has to indicate which one he/she does not want to use with --validationTypeToExclude. Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out. Stories: - https://www.pivotaltracker.com/story/show/68725164 Changes: - Removed validateType argument. - Improved documentation. - Added some warnning log message on suspicious argument combinations. Tests: - ValidateVariantsIntegrationTest#*	2014-04-07 16:27:11 -04:00
Ryan Poplin	f058224b3e	Adding GenotypeSummaries as INFO field annotations. -- This is needed so the ref model pipeline can cut down to sites-only files without losing these useful statistics. -- Added new unit test to test this info field annotation. -- GenotypeGVCF integration tests change because new annotations are present in the output	2014-04-06 11:50:10 -04:00
MauricioCarneiro	84861fa10a	Merge pull request #587 from broadinstitute/eb_actually_fail_on_reduced_bams Make sure to fail in all cases where the BAM being used was created by ReduceReads.	2014-04-04 17:27:57 -04:00
Valentin Ruano-Rubio	18deeec6b0	Addresses issue with strict validation on GVCF files. More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples. This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles. To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES' Story: https://www.pivotaltracker.com/story/show/68725164 Changes: - Added validateTypeToExclude argument to ValidateVariants walker. - Implemented the selective exclusion of validation types. - Added new info and improved existing documentation of the ValidateVariants walker. Tests: - ValidateVariantsIntegrationTest#testUnusedAlleleError - ValidateVariantsIntegrationTest#testUnusedAlleleFix	2014-04-04 14:37:10 -04:00
Eric Banks	7174f8cfeb	IndelRealigner throws a user error when it encounters reads with I operators greater than the number of read bases. Added test to ensure it works.	2014-04-03 18:16:24 -04:00
Eric Banks	a3d55b3341	Make sure to fail in all cases where the BAM being used was created by ReduceReads. In some cases, the program records were being removed from the BAM headers by the GATK engine before we applied the check for reduced reads (so we did not fail appropriately). Pushed up the check to happen before the PG tags are modified and added a unit test to ensure it stays that way. It turns out that some UG tests still used reduced bams so I switched to use different ones. Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.	2014-04-03 16:52:41 -04:00
Eric Banks	0b73573abc	Slightly modifying the way to use the IUPAC ambiguity codes in the FastaAlternateReferenceMaker. Previously it required you to create a single sample VCF and then to pass that in to the tool, but Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs). Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used for the IUPAC encoding. Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.	2014-04-02 21:34:25 -04:00
Valentin Ruano-Rubio	84711b8e90	Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale) Stories: https://www.pivotaltracker.com/story/show/66263868 Bug: The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in in linear scale computations that ins transformed into an infinity in log scale. Changes: - Change to use log10 scale for calculate those penalties. - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.	2014-04-01 16:14:52 -04:00
Eric Banks	821fbe7260	Merge pull request #582 from broadinstitute/vrr_hc_bugfixes_dangling_heads Fix loss of key alternative haplotypes due to a change on threading star...	2014-03-31 10:42:08 -04:00
Joel Thibault	2049eb1658	Rev Picard 1.110.1763 - SamPairUtils migrated in Picard r1737 - Revert IndelRealigner changes made in commit 4f4b85 -- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java -- Picard revision 1723 reverts these changes, so we also revert to match	2014-03-30 09:33:57 -04:00
Valentin Ruano-Rubio	258b2bce28	Fix loss of key alternative haplotypes due to a change on threading start policy required when recovering dangling heads. Story: - https://www.pivotaltracker.com/story/show/67601310 Change: - Unless recover-danging-heads is active, the threading starting location policy is the original one. i.e. just at already existing unique kmer vertices. Tests: - HaplotypeCallerIntegrationTest#testMissingKeyAlternativeHaplotypesBugFix	2014-03-29 22:40:26 -04:00
Ryan Poplin	6566dd6ca9	Fix for dropping of reference sample depth in the DP annotation. -- In the case of hierarchical merge we can't assume that we have only one genotype. -- Removed use of deprecated VC annotation access functions.	2014-03-24 14:01:50 -04:00
Eric Banks	32a96e3ab3	Fix for reads that are all insertions (e.g. 50I) and causing the IndelRealigner to error out.	2014-03-21 15:01:34 -04:00
Eric Banks	7c8ce3cd6a	Several improvements to GenotypeGVCFs: --includeNonVariantSites now actually works and we propagate AD to hom ref samples	2014-03-20 00:35:54 -04:00
Eric Banks	824983af1d	Enable CombineGVCFs to process gVCFs that were created with basepair resolution.	2014-03-19 19:23:05 -04:00
Eric Banks	3b1c337401	Have CombineVariants throw a UserError when trying to combine GVCFs from the HaplotypeCaller. Was previously throwing an IllegalArgumentException (in the wrong place in the code). Error message tells users to use CombineGVCFs.	2014-03-19 19:11:40 -04:00
Valentin Ruano-Rubio	905b6066b2	Reduce runtime of very long integration test	2014-03-18 21:48:13 -04:00
David Roazen	2d8653f493	Update pom versions to mark the start of GATK 3.2 development	2014-03-18 01:18:59 -04:00
David Roazen	a6a41c777c	Update pom versions for 3.1	2014-03-18 01:09:29 -04:00
Alec Wysoker	0369f93b24	GATK changes to conform to Tribble refactoring as part improving Tabix support in Tribble (among other things). 1. Enable on-the-fly indexing for vcf.gz. 2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created. 3. Add method setProgressLogger to all SAMFileWriter implementations. 4. Revved picard to 1.109.1722 5. IndelRealigner md5s change because the MC tag is added to records now. Fixed up and signed off by ebanks.	2014-03-17 11:56:22 -04:00
Eric Banks	34c697bf12	Merge pull request #554 from broadinstitute/bh_SOR_new_annotation Bh sor new annotation	2014-03-17 10:58:13 -04:00
Laura Gauthier	40c13d446a	Added documentation category for CalculateGenotypePosteriors	2014-03-17 10:36:19 -04:00
Valentin Ruano-Rubio	2e964c59b4	Improved criteria to select best haplotypes out from the assembly graph. Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge multiplicity sum across their path in the assembly graph. The edge mulitplicity is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it. Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is: As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain. This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates: Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex. This pull-request addresses the following stories: https://www.pivotaltracker.com/story/show/66691418 https://www.pivotaltracker.com/story/show/64319760 Change Summary: 1. Change to the new scoring schema. 2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring. 3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)	2014-03-14 18:37:01 -04:00
Bertrand Haas	82108d110f	New abstract class StrandBiasTest() with old sub-class FisherStrand() and new sub-class StrandOddsRatio(). Latter is test based on symmetric odds ratio more appropriate than Fisher exact test when number of samples is large. https://www.pivotaltracker.com/story/show/66087886	2014-03-14 18:33:21 -04:00
Eric Banks	7c7ff90266	Merge pull request #558 from broadinstitute/rp_vqsr_nondeterminism_fix Fix for non-determinism in the VQSR with very large data sets	2014-03-12 14:35:51 -04:00
Eric Banks	ffaf92f871	Added new functionality to the FastaAlternateReferenceMaker to have it output IUPAC codes for het sites. Enable it with the new --useIUPAC argument. Added both unit and integration tests for the new functionality - and fixed up the exising tests once I was in there.	2014-03-12 14:31:57 -04:00
Ryan Poplin	907d1d6160	Fix for non-determinism in the VQSR with very large data sets	2014-03-12 10:25:12 -04:00
ldgauthier	4e74e77e74	Merge pull request #555 from broadinstitute/eb_add_option_to_CGVCFs_for_all_sites_GVCF Added an option to CombineGVCFs to create basepair resolution gVCFs from...	2014-03-12 10:01:18 -04:00
David Roazen	c67ced5f3b	Emit a warning whenever the VectorLoglessPairHMM is used	2014-03-12 09:55:35 -04:00
Eric Banks	d697e0144f	Added an option to CombineGVCFs to create basepair resolution gVCFs from banded ones. Use the --convertToBasePairResolution argument to enable this functionality.	2014-03-12 01:32:51 -04:00
Ryan Poplin	34d11fe40c	Added the consensus mode used for the 1000 Genomes Project to the HaplotypeCaller. -- All the provided alleles are added to the assembly graph as potential haplotypes but they aren't forcibly genotyped like in GGA mode. -- Added integration test for this mode	2014-03-11 09:56:35 -04:00
droazen	8b53567dc7	Merge pull request #553 from broadinstitute/dr_rename_pipeline_tests Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests	2014-03-10 21:36:45 -04:00
David Roazen	78562c14bb	Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests -These tests are really integration tests for Queue rather than generalized pipeline tests, so it makes sense to call them QueueTests. -Rename test classes and maven build targets, and update shell scripts to reflect new naming.	2014-03-10 21:24:03 -04:00
David Roazen	7c34f05082	Merge remote-tracking branch 'origin/master' into intel	2014-03-10 14:07:36 -04:00
David Roazen	5a6aa54673	Revert "Update HaplotypeCaller and VariantAnnotator test MD5s" This reverts commit 7faa44d576b06d7aef29562e82590a7855f216f4.	2014-03-10 14:06:51 -04:00
David Roazen	e7d6db033b	Revert "Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING"" This reverts commit c8a34749e631b92214a57bba162c6e0d849425f1.	2014-03-10 14:05:51 -04:00
David Roazen	f070583f29	Update HaplotypeCaller and VariantAnnotator test MD5s There are a few innocuous test failures on this branch -- updating MD5s after reviewing the differences in output	2014-03-07 10:54:27 -05:00
Karthik Gururaj	6e98e9e589	Removed g_haplotype* global variables in native code so that it works with multi-threading in Java. Modified VectorLoglessPairHMM.java so that jniInitializeRegion and jniFinalizeRegion are empty	2014-03-06 22:08:35 -08:00
David Roazen	3f3df90412	Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING" This reverts commit cef03f089fb3f131f3a77664b71feaec51a74cc8.	2014-03-06 10:15:35 -05:00
David Roazen	9df59bd8cc	Update pom versions to mark the start of GATK 3.1 development	2014-03-06 00:05:58 -05:00
David Roazen	34edcb8ddf	Update pom versions for the 3.0 release	2014-03-05 23:37:21 -05:00
David Roazen	53895e15cd	Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING	2014-03-05 19:26:37 -05:00
Eric Banks	d3de6413c9	Move warnings to debug logging status because they will definitely scare users	2014-03-05 15:05:21 -05:00
Karthik Gururaj	51b8ea5d59	Reset version	2014-03-05 11:19:08 -08:00
Karthik Gururaj	b9afe800ae	Merge correction	2014-03-05 10:06:45 -08:00
Karthik Gururaj	8fcbf9272c	Merge branch 'intel_pairhmm' of /data/broad/gsa-unstable into intel_pairhmm Conflicts: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java public/VectorPairHMM/src/main/c++/Sandbox.java	2014-03-05 09:35:50 -08:00
Intel Repocontact	d81116eb1d	Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK. C++ code has PAPI calls for reading hardware counters Followed Khalid's suggestion for packing libVectorLoglessCaching into the jar file with Maven Native library part of git repo 1. Renamed directory structure from public/c++/VectorPairHMM to public/VectorPairHMM/src/main/c++ as per Khalid's suggestion 2. Use java.home in public/VectorPairHMM/pom.xml to pass environment variable JRE_HOME to the make process. This is needed because the Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among others). Assuming that the Maven build process uses a JDK (and not just a JRE), the variable java.home points to the JRE inside maven. 3. Dropped all pretense at cross-platform compatibility. Removed Mac profile from pom.xml for VectorPairHMM Moved JNI_README 1. Added the catch UnsatisfiedLinkError exception in PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING in case the native library could not be loaded. Made VECTOR_LOGLESS_CACHING as the default implementation. 2. Updated the README with Mauricio's comments 3. baseline.cc is used within the library - if the machine supports neither AVX nor SSE4.1, the native library falls back to un-vectorized C++ in baseline.cc. 4. pairhmm-1-base.cc: This is not part of the library, but is being heavily used for debugging/profiling. Can I request that we keep it there for now? In the next release, we can delete it from the repository. 5. I agree with Mauricio about the ifdefs. I am sure you already know, but just to reassure you the debug code is not compiled into the library (because of the ifdefs) and will not affect performance. 1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java 2. Committing the right set of files after rebase Added public license text to all C++ files Added license to Makefile Add package info to Sandbox.java Conflicts: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java public/VectorPairHMM/src/main/c++/.gitignore public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h public/VectorPairHMM/src/main/c++/Makefile public/VectorPairHMM/src/main/c++/Sandbox.cc public/VectorPairHMM/src/main/c++/Sandbox.h public/VectorPairHMM/src/main/c++/Sandbox.java public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h public/VectorPairHMM/src/main/c++/baseline.cc public/VectorPairHMM/src/main/c++/define-double.h public/VectorPairHMM/src/main/c++/define-float.h public/VectorPairHMM/src/main/c++/define-sse-double.h public/VectorPairHMM/src/main/c++/define-sse-float.h public/VectorPairHMM/src/main/c++/headers.h public/VectorPairHMM/src/main/c++/jnidebug.h public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc public/VectorPairHMM/src/main/c++/run.sh public/VectorPairHMM/src/main/c++/shift_template.c public/VectorPairHMM/src/main/c++/utils.cc public/VectorPairHMM/src/main/c++/utils.h public/VectorPairHMM/src/main/c++/vector_function_prototypes.h	2014-03-05 09:30:29 -08:00
Laura Gauthier	43fdd38342	Add error handling to CalculateGenotypePosteriors to catch multiallelic variants with wrong number of ACs -- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases	2014-03-05 12:03:18 -05:00
Laura Gauthier	7f9f58dbd1	Added hidden flag to GenotypeConcordance to output sites of discordant genotypes (to System.out) Revised ConcondanceMetrics tests to adapt to change Added comments to PosteriorLikelihoodsUtils	2014-03-05 12:03:18 -05:00
Karthik Gururaj	2648b41398	Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK. C++ code has PAPI calls for reading hardware counters Followed Khalid's suggestion for packing libVectorLoglessCaching into the jar file with Maven Native library part of git repo 1. Renamed directory structure from public/c++/VectorPairHMM to public/VectorPairHMM/src/main/c++ as per Khalid's suggestion 2. Use java.home in public/VectorPairHMM/pom.xml to pass environment variable JRE_HOME to the make process. This is needed because the Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among others). Assuming that the Maven build process uses a JDK (and not just a JRE), the variable java.home points to the JRE inside maven. 3. Dropped all pretense at cross-platform compatibility. Removed Mac profile from pom.xml for VectorPairHMM Moved JNI_README 1. Added the catch UnsatisfiedLinkError exception in PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING in case the native library could not be loaded. Made VECTOR_LOGLESS_CACHING as the default implementation. 2. Updated the README with Mauricio's comments 3. baseline.cc is used within the library - if the machine supports neither AVX nor SSE4.1, the native library falls back to un-vectorized C++ in baseline.cc. 4. pairhmm-1-base.cc: This is not part of the library, but is being heavily used for debugging/profiling. Can I request that we keep it there for now? In the next release, we can delete it from the repository. 5. I agree with Mauricio about the ifdefs. I am sure you already know, but just to reassure you the debug code is not compiled into the library (because of the ifdefs) and will not affect performance. 1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java 2. Committing the right set of files after rebase Added public license text to all C++ files Added license to Makefile Add package info to Sandbox.java	2014-03-05 08:31:24 -08:00
Intel Repocontact	6aa67a2585	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2014-03-05 08:14:32 -08:00
Intel Repocontact	1de2d2546e	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2014-03-05 00:04:21 -08:00
Valentin Ruano-Rubio	69bf2b3247	Added a more efficient implementation of the KBest haplotype finder code (CONT.) Changes: 1. Addressed review comments on new K-best haplotype assembly graph finder. 2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices. 3. Updated test to use KBestHaplotypeFinder instead of KBestPaths 4. Retired KBestPaths to the archive. 5. Small improvements to the code and documentation.	2014-03-04 23:22:27 -05:00
Valentin Ruano-Rubio	7acf2eb0e7	Added a more efficient implementation of the KBest haplotype finder code. Story: https://www.pivotaltracker.com/story/show/66238286 Changes: 1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder. 2. Changed HC code to use the new implementation. This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases 3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code. In any case that seems like a more natural location for that functionality.	2014-03-04 12:22:14 -05:00
Eric Banks	b99bf85ec8	Fixed bug where dangling tail merging occasionally created a cycle in the graph. Added unit tests to cover this case. Delivers PT#66690470.	2014-03-03 22:42:56 -05:00
Eric Banks	4d69af189e	Minor change: make the --dontUseSoftClippedBases @Advanced instead of @Hidden	2014-03-03 15:59:32 -05:00
Eric Banks	fa65716fe9	Added code to retrieve dangling heads from the read threading graph (previously we were rescuing just the tails). The purpose of this is to be able to call SNPs that fall at the beginning of a capture region (or exon). Before, the read threading code would only start threading from the first kmer that matched the reference. But that means that, in the case of a SNP at the beginning of an exome, it wouldn't start threading the read until after the SNP position - so we'd lose the SNP. For now, this is still very experimental. It works well for RNAseq data, but does introduce FPs in normal exomes. I know why this is and how to fix it, but it requires a much larger fix to the HC: the HC needs to pass all reads and bases to the annotation engine (like UG does) instead of just the high quality ones. So for now, the head merging is disabled by default. As per reviewer comments, I moved the head and tail merging code out into their own class.	2014-03-03 15:59:26 -05:00
amilev	cecdd2f2c5	Merge pull request #539 from broadinstitute/eb_hard_clip_exon_overhangs_for_ami Add the capability to the N-cigar splitter to also hard-clip off overhan...	2014-03-03 12:23:11 -05:00
Eric Banks	6c872308d8	Add the capability to the N-cigar splitter to also hard-clip off overhangs based on observed split positions. We use a "manager" to keep track of observed splits and previous reads. This can be extended/modified in the future to try to salvage those overhangs instead of hard-clipping them and/or try other possible strategies. Added unit tests and more integration tests.	2014-03-02 21:10:34 -05:00
Eric Banks	22ad18b919	Moving Reduce Reads to the archive. The GATK now fails with a user error if you try to run with a reduced bam. (I added a unit test for that; everything else here is just the removal of all traces of RR)	2014-03-02 02:03:14 -05:00
Karthik Gururaj	1b395a871a	1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java 2. Committing the right set of files after rebase	2014-02-28 16:08:28 -08:00
Karthik Gururaj	37526dfad5	1. Added the catch UnsatisfiedLinkError exception in PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING in case the native library could not be loaded. Made VECTOR_LOGLESS_CACHING as the default implementation. 2. Updated the README with Mauricio's comments 3. baseline.cc is used within the library - if the machine supports neither AVX nor SSE4.1, the native library falls back to un-vectorized C++ in baseline.cc. 4. pairhmm-1-base.cc: This is not part of the library, but is being heavily used for debugging/profiling. Can I request that we keep it there for now? In the next release, we can delete it from the repository. 5. I agree with Mauricio about the ifdefs. I am sure you already know, but just to reassure you the debug code is not compiled into the library (because of the ifdefs) and will not affect performance.	2014-02-28 08:59:55 -08:00
Karthik Gururaj	0fe843bfd9	Followed Khalid's suggestion for packing libVectorLoglessCaching into the jar file with Maven	2014-02-26 11:47:42 -08:00
Karthik Gururaj	15fe244e4b	Now has PAPI values	2014-02-26 11:47:42 -08:00
Intel Repocontact	e32e9e6af6	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2014-02-26 11:47:01 -08:00
Intel Repocontact	ff2a972ab5	Merge branch 'master' of github.com:broadinstitute/gsa-unstable Conflicts: .gitignore	2014-02-25 20:56:28 -08:00
Khalid Shakir	f02ce6eca7	Added tests for cleaning up scattered .bai files, and using the log directory. Re-added import java.io.File for BamGatherFunction. Other cleanup to resolve scala syntax warnings from intellij. Moved Example UG script to from protected to public.	2014-02-26 02:11:28 +08:00
Eric Banks	0f30df0356	Stopgap procedure to rescue Fisher Strand for cases where there's lots of data. This commit consists of 2 main changes: 1. When the strand table gets too large, we normalize it down to values that are more reasonable. 2. We don't include a particular sample's contribution unless the total ref and alt counts are at least 2 each; this is a heuristic method for dealing only with hets. MD5s change as expected. Hopefully we'll have a more robust implementation for GATK 3.1.	2014-02-25 01:04:27 -05:00
droazen	e8ea9f58d3	Merge pull request #531 from broadinstitute/ks_build_patches Build patches	2014-02-24 15:13:16 -05:00
Valentin Ruano-Rubio	0b3a70b8c1	Fix for a bug a bug in (Assembly Graph) Routes. The slicePrefix method functionality was broken. Story: https://www.pivotaltracker.com/story/show/64595624 Changes: 1. Fixed the bug. 2. Added unit test to check on the method functionality. 3. Added a integration test to verify the bug has been fixed in a empirical data reprudible case.	2014-02-24 10:54:39 -05:00
Khalid Shakir	7e516b294f	Replaced local drmaa and Jama artifacts with versions from maven central. Removed unused caliper binary from local repo.	2014-02-22 01:21:35 +08:00
Valentin Ruano-Rubio	463af7143f	Activate reverse allele trimming in GVCF Story: https://www.pivotaltracker.com/s/projects/1007536 Changes: 1. HC's GenotypingEngine now invokes reverseAlleleTrimming on GVCF variant output lines. 2. GenotypeGVCFs also reverse trim after regenotyping as some alt. alleles are dropped (observed in real-data).	2014-02-20 03:17:24 -05:00
Eric Banks	53a7d5cbae	Fixing a bug in the GVCF writer. The writer was never resetting the pointer to the end of the last non-ref VariantContext that it saw. This was fine except when it jumped to a new contig - and a lower position on that contig - where it thought that it was still part of that previous non-ref VariantContext so wouldn't emit a reference block. Therefore, ref blocks were missing from the beginnings of all chromosomes (except chr1). Added unit test to cover this case.	2014-02-20 02:33:43 -05:00
Valentin Ruano-Rubio	c167fb5fdf	Fixing GenotypesGVCF. Bug uncovered by some untrimmed alleles in the single sample pipeline output. Notice however does not fix the untrimmed alleles in general. Story: https://www.pivotaltracker.com/story/show/65481104 Changes: 1. Fixed the bug itself. 2. Fixed non-working tests (sliently skipped due to exception in dataProvider).	2014-02-19 14:20:39 -05:00
Ryan Poplin	43c20264b0	Initial commit of the random forest classifier.	2014-02-17 13:07:27 -05:00
droazen	688792c5b0	Merge pull request #520 from broadinstitute/jt_fix_failing_tests_post_maven Fix for the Array Out of Bounds test error	2014-02-14 14:02:17 -05:00
Eric Banks	3724d4e5f3	Various small fixes for CalculateGenotypePosteriors based on feedback from guys in Ben Neale's group. Note that this tool is still a work in progress and very experimental, so isn't 100% stable. Most of the features are untested (both by people and by unit/integration tests) because Chris Hartl implemented it right before he left, and we're going to need to add tests at some point soon. I added a first integration test in this commit, but it's just a start. The fixes include: 1. Stop having the genotyping code strip out AD values. It doesn't make sense that it should do this so I don't know why it was doing that at all. Updated GenotypeGVCFs so that it doesn't need to manually recover them anymore. This also helps CalculateGenotypePosteriors which was losing the AD values. Updated code in LeftAlignAndTrimVariants to strip out PLs and AD, since it wasn't doing that before. Updated the integration test for that walker to include such data. 2. Chris was calling Math.pow directly on the normalized posteriors which isn't safe. Instead, the normalization routine itself can revert back to log scale in a safe manner so let's use it. Also, renamed the variable to posteriorProbabilities (and not likelihoods). 3. Have CGP update the AC/AF/AN counts after fixing GTs.	2014-02-14 13:48:14 -05:00
Joel Thibault	cb7ad01202	Re-enable the relevant tests	2014-02-14 12:34:08 -05:00
Joel Thibault	c8a5007c85	Add a comment to the method where the error appears	2014-02-14 11:40:22 -05:00
Joel Thibault	ec16439387	Clear the ReadCovariates keysCache before runs of individual Unit Tests - normal runs have a constant covariate count, so this is not necessary	2014-02-14 10:41:28 -05:00
Eric Banks	7095a60c8e	Merge pull request #516 from broadinstitute/dr_reenable_tests_failing_due_to_java_update Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation	2014-02-13 21:05:18 -05:00
David Roazen	4b4b93ad1b	Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation After extensive detective work, Joel determined that these tests were failing due to changes in the implementation of Math.pow() in newer versions of Java 1.7. All GSA members should ensure that they're using a JDK that is at least as current as the one in the Java-1.7 dotkit on the Broad servers (build 1.7.0_51-b13).	2014-02-12 16:08:16 -05:00
Joel Thibault	cc9477aedb	Minimal test for the multi-allelic reordering bug	2014-02-12 13:38:32 -05:00
Eric Banks	300b474c96	Several improvements to the single sample combining steps. 1. updated QualByDepth not to use AD-restricted depth if it is zero. Added unit test this change. 2. Fixed small bug in CombineGVCFs where spanning deletions were not being treated consistently throughout. Added test for this situation. 3. Make sure GenotypeGVCFs puts in the required headers. Updated test files to make sure this is covered. 4. Have GenotypeGVCFs propagate up the MLEAC/AF (which were getting clobbered out). Tests updated to account for this.	2014-02-12 10:15:12 -05:00
David Roazen	95e1402d21	Add ability to run *KnowledgeBaseTests to maven Run with: mvn verify -Dsting.knowledgebasetests.skipped=false	2014-02-11 14:08:24 -05:00
Eric Banks	303a60c8c6	Adding smarts to the QD annotation: when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Added new unit tests for QualByDepth.	2014-02-11 12:56:49 -05:00
Eric Banks	2e36dd9001	Refactoring of CombineGVCFs to make it run a lot faster. Creating new VariantContexts each time we broke up a block was very expensive because we break up blocks so often. Also, calling into GATKVariantContextUtils.simpleMerge was really hurting performance. MD5 changes because we no longer propogate any INFO fields (except for END) for reference blocks; the tests have the now unused BLOCK_SIZE field that now get dropped.	2014-02-11 03:18:52 -05:00
Eric Banks	abef6cfcb6	Removing parameters that were incorrectly copied over from RegenotypeVariants.	2014-02-08 23:44:32 -05:00
Eric Banks	659a9f0e79	Removing the test for BLOCK_SIZE since we no longer emit it	2014-02-08 21:28:07 -05:00
Valentin Ruano-Rubio	bf630abe88	Fixed nocall (./.) without PLs bug in GVCF output Story: https://www.pivotaltracker.com/story/show/65388246 Additional changes and notes: 1. The fix consist in forcing the output of all PLs by setting the standard flag for that '-allSitePLs'. 2. BP_RESOLUTION was handled differently to GVCF in some aspect that should be common. That has been fixed.	2014-02-07 19:30:26 -05:00
Karthik Gururaj	20a46e4098	Check only for SSE 4.1 (rather than SSE 4.2) when trying to use the SSE implementation of PairHMM	2014-02-07 15:19:55 -08:00
Karthik Gururaj	dc44b64ad8	1. Added support for building the PairHMM vector library into build.xml. The library is compiled using makefile and copied into the directory: build/java/classes/org/broadinstitute/sting/utils/pairhmm/ 2. Bundled the library into StingUtils.jar. Unpacked and loaded at runtime without the need to set java.library.path Caveats: Platform independence has probably been thrown out of the window. Assumptions: a. make command exists at /usr/bin/make b. rsync command exists at /usr/bin/rsync c. icc is in the PATH of the user	2014-02-07 13:13:59 -08:00
Eric Banks	d689f61005	Fixed up some of the genotype-level annotations being propogated in the single sample HC pipeline. 1. AD values now propogate up (they weren't before). 2. MIN_DP gets transferred over to DP and removed. 3. SB gets removed after FS is calculated. Also, added a bunch of new integration tests for GenotypeGVCFs.	2014-02-07 12:47:54 -05:00
Eric Banks	67ed0d2403	The UG engine can return a null VC if there are tons of alt alleles, causing Tim's merge jobs to fail. Pushing the null check up so that it doesn't error out in such cases.	2014-02-07 12:41:20 -05:00
Valentin Ruano-Rubio	4a3c8e68fa	Fixed out of order non-variant gVCF entries when trimming is active. Story: https://www.pivotaltracker.com/story/show/65319564	2014-02-07 11:03:26 -05:00
Eric Banks	eb463b505d	Remove a whole bunch of unused annotations from gVCF output. AC,AF,AN,FS,QD - they'll all be recomputed later. BLOCK_SIZE and MIN_GQ were not necessary. I also made the StrandBiasBySample annotation forced on when in gVCF mode. It turns out that its output wasn't compatible with BCF so I patched it (and the variant jar too).	2014-02-07 08:49:36 -05:00
Eric Banks	2648219c42	Implementation of a hierarchical merger for gVCFs, called CombineGVCFs. This tool will take any number of gVCFs and create a merged gVCF (as opposed to GenotypeGVCFs which produces a standard VCF). Added unit/integration tests and fixed up GATK docs.	2014-02-07 08:49:18 -05:00
Eric Banks	71b47a6148	Rename CombineReferenceCalculationVariants to GenotypeGVCFs	2014-02-06 15:46:19 -05:00
Khalid Shakir	3848159086	Added a set of serial tests to gatk/queue packages, which runs all tests under their package in one TestNG execution. New properties to disable regenerating example resources artifact when each parallel test runs under packagetest. Moved collection of packagetest parameters from shell scripts into maven profiles. Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests. When building picard libraries, run clean first. Fixed tools jar dependency in picard pom. Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.	2014-02-06 08:25:38 -05:00
Valentin Ruano Rubio	988e3b4890	Merge pull request #487 from broadinstitute/vrr_reference_model_with_trimming Get gVCF to work without --dontTrimActiveRegions	2014-02-05 22:52:17 -05:00
Valentin Ruano-Rubio	98ffcf6833	Get gVCF to work without --dontTrimActiveRegions Story: https://www.pivotaltracker.com/story/show/65048706 https://www.pivotaltracker.com/story/show/65116908 Changes: ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet. Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs	2014-02-05 22:50:45 -05:00
Ryan Poplin	693bfac341	Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places. -- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests	2014-02-05 12:58:48 -05:00
Eric Banks	91bdf069d3	Some updates to CRCV. 1. Throw a user error when the input data for a given genotype does not contain PLs. 2. Add VCF header line for --dbsnp input 3. Need to check that the UG result is not null 4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)	2014-02-05 10:12:37 -05:00
Joel Thibault	9eaee8c73c	Integration test for the -nt race condition corrupting AD and PL fields	2014-02-04 22:04:27 -05:00
David Roazen	1de7a27471	Disable an additional test that is runtime dependent on one of the temporarily-disabled tests	2014-02-04 16:07:58 -05:00
David Roazen	76086f30b7	Temporarily disable tests that started failing post-maven Joel is working on these failures in a separate branch. Since maven (currently! we're working on this..) won't run the whole test suite to completion if there's a failure early on, we need to temporarily disable these tests in order to allow group members to run tests on their branches again.	2014-02-04 15:31:24 -05:00
Khalid Shakir	857e6e0d6f	Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script.	2014-02-03 13:50:46 -05:00
Khalid Shakir	9ca3004fc3	Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency.	2014-02-03 13:50:46 -05:00
Khalid Shakir	de13f41fc3	One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules.	2014-02-03 13:50:46 -05:00
Khalid Shakir	caa76cdac4	Added maven pom.xmls for various artifacts.	2014-02-03 13:50:46 -05:00
Khalid Shakir	1e25a758f5	Moved files to maven directories. Here are the git moved directories in case other files need to be moved during a merge: git-mv private/java/src/ private/gatk-private/src/main/java/ git-mv private/R/scripts/ private/gatk-private/src/main/resources/ git-mv private/java/test/ private/gatk-private/src/test/java/ git-mv private/testdata/ private/gatk-private/src/test/resources/ git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/ git-mv private/scala/src/ private/queue-private/src/main/scala/ git-mv protected/java/src/ protected/gatk-protected/src/main/java/ git-mv protected/java/test/ protected/gatk-protected/src/test/java/ git-mv public/java/src/ public/gatk-framework/src/main/java/ git-mv public/java/test/ public/gatk-framework/src/test/java/ git-mv public/testdata/ public/gatk-framework/src/test/resources/ git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/ git-mv public/scala/src/ public/queue-framework/src/main/scala/ git-mv public/scala/test/ public/queue-framework/src/test/scala/	2014-02-03 13:50:44 -05:00
Valentin Ruano-Rubio	89c4e57478	gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested. Changes: ------- <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles. When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake. Integration test MD5 have been changed accordingly. Additional fix: -------------- Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick. In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.	2014-01-30 11:23:33 -05:00
Karthik Gururaj	0c63d6264f	1. Added synchronization block around loadLibrary in VectorLoglessPairHMM 2. Edited Makefile to use static libraries where possible	2014-01-27 15:34:58 -08:00
Karthik Gururaj	85a748860e	1. Added more profiling code 2. Modified JNI_README	2014-01-27 14:32:44 -08:00
Valentin Ruano-Rubio	748d2fdf92	Added Integration test to verify the bugs are not there anymore as reported in pivotracker	2014-01-26 23:29:31 -05:00
Karthik Gururaj	018e9e2c5f	1. Cleaned up code 2. Split into DebugJNILoglessPairHMM and VectorLoglessPairHMM with base class JNILoglessPairHMM. DebugJNILoglessPairHMM can, in principle, invoke any other child class of JNILoglessPairHMM. 3. Added more profiling code for Java parts of LoglessPairHMM	2014-01-26 19:18:12 -08:00
Valentin Ruano-Rubio	9e7bf75e89	Fix for the PairHMM transition probability miscalculation. Problem: matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1. Reports: https://www.pivotaltracker.com/s/projects/793457/stories/62471780 https://www.pivotaltracker.com/s/projects/793457/stories/61082450 Changes: The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies. MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel. Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied small changes in likelihoods that is what is expected.	2014-01-26 16:30:36 -05:00
Karthik Gururaj	81bdfbd00d	Temporary commit before moving to new native library	2014-01-24 16:29:35 -08:00
Karthik Gururaj	936e9e175e	1. Converted q,i,d,c in C++ from int* to char* 2. Use clock_gettime to measure performance 3. Disabled OpenMP 4. Moved LoadTimeInitializer to different file	2014-01-22 22:57:32 -08:00
Karthik Gururaj	733a84e4f9	Added support to transfer haplotypes once per region to the JNI Re-use transferred haplotypes (stored in GlobalRef) across calls to computeLikelihoods	2014-01-22 10:52:41 -08:00
Karthik Gururaj	88c08e78e7	1. Inserted #define in sandbox pairhmm-template-main.cc 2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE 3. OpenMP disabled 4. Added code for initializing PairHMM's data inside initializePairHMM - not used yet	2014-01-21 09:57:14 -08:00
Ryan Poplin	bdd06ebfc2	Merge pull request #478 from broadinstitute/eb_generalize_hc_values_as_args Pulled out some hard-coded values from the read-threading and isActive c...	2014-01-21 09:01:54 -08:00
Karthik Gururaj	7180c392af	1. Integrated Mohammad's SSE4.2 code, Mustafa's bug fix and code to fix the SSE compilation warning. 2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in that order) 3. Created multiple files to compile with different compilation flags: avx_function_prototypes.cc is compiled with -xAVX while sse_function_instantiations.cc is compiled with -xSSE4.2 flag. 4. Added jniClose() and support in Java (HaplotypeCaller, PairHMMLikelihoodCalculationEngine) to call this function at the end of the program. 5. Removed debug code, kept assertions and profiling in C++ 6. Disabled OpenMP for now.	2014-01-20 08:03:42 -08:00
Eric Banks	9e858270d7	Moving this test up one level to where it actually belongs.	2014-01-19 02:33:11 -05:00
Eric Banks	64d5bf650e	Pulled out some hard-coded values from the read-threading and isActive code of the HC, and made them into a single argument. In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity). Very little code actually changes here, but just about every md5 in the HC integration tests are different (as expected). Added another integration test for the new argument. To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?	2014-01-19 01:15:13 -05:00
Karthik Gururaj	25aecb96e0	Added support for dynamic selection between AVX and un-vectorized C++, still to include SSE code from Mohammad. Debug flags turned on in this commit.	2014-01-18 11:07:23 -08:00
Karthik Gururaj	f1c772ceea	Same log message as before - forgot -a option 1. Moved computeLikelihoods from PairHMM to native implementation 2. Disabled debug - debug code still left (hopefully, not part of bytecode) 3. Added directory PairHMM_JNI in the root which holds the C++ library that contains the PairHMM AVX implementation. See PairHMM_JNI/JNI_README first	2014-01-16 21:40:04 -08:00
Karthik Gururaj	e8a5022777	1. Added support for JNI integration for LoglessCaching PairHMM AVX implementation. 2. Contains lots of debug code 3. Only invokes JNI for subComputeReadLikelihoodGivenHaplotypeLog10	2014-01-15 11:07:09 -08:00
Eric Banks	de56134579	Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF. It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now. Since I thought that it might prove useful to others, I moved it to protected and added integration tests. GERALDINE: NEW TOOL ALERT!	2014-01-15 13:49:31 -05:00
Eric Banks	9f1ab0087a	Added in a check for what would be an empty allele after trimming.	2014-01-15 11:04:19 -05:00
Ryan Poplin	201ad398ac	Merge pull request #473 from broadinstitute/eb_fix_qd_indel_normalization The QD normalization for indels was busted and is now fixed.	2014-01-14 08:56:19 -08:00
Eric Banks	e4fdc5ac44	Merge pull request #474 from broadinstitute/eb_fix_haplotype_resolver_PT63333488 Fixing the Haplotype Resolver so that it doesn't complain about missing header lines	2014-01-14 07:36:53 -08:00
Eric Banks	fd511d12a2	Fixing the Haplotype Resolver so that it doesn't complain about missing header lines. The code comments very clearly state that INFO fields shouldn't be propagated into the output, but someone must have accidentally changed it afterwards. This is just a simple one-line fix to make sure the code adhered to the comments. Delivers #63333488.	2014-01-13 22:47:43 -05:00
Geraldine Van der Auwera	8fcad6680b	Assorted fixes and improvements to gatkdocs -Added docs for ERC mode in HC -Move RecalibrationPerformance walker since to private since it is experimental and unsupported -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them -Improved error msg for conflict between per-interval aggregation and -nt -Minor clean up in exception docs -Added Toy Walkers category for devs and dev supercat (to build out docs for developers) -Added more detailed info to GenotypeConcordance doc based on Chris forum post -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples) -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it) -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)	2014-01-13 17:46:22 -05:00
Eric Banks	c7e08965d0	The QD normalization for indels was busted and is now fixed. It is true that indels of length > 1 have higher QUALS than those of length = 1. But for the HC those QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger. So we no longer normalize by indel length (which massively over-penalizes larger events and effectively drops their QD to 0). For the UG the previous normalization also wasn't perfect. Now we divide the indel length by a factor of 3 to make sure that QD is consistent over the range of indel lengths. Integration tests change because QD is different for indels. Also, got permission from Valentin to archive a failing test that no longer applies. Thanks to Kurt on the GATK forum for pointing this all out.	2014-01-13 15:23:36 -05:00
droazen	7cd304fb41	Merge pull request #470 from broadinstitute/mf_new_RBP Mf new rbp	2014-01-13 08:46:27 -08:00
MauricioCarneiro	50cd6781b3	Merge pull request #465 from broadinstitute/eb_improvements_to_ref_confidence_merger Improvements to ref confidence merger	2014-01-08 10:51:01 -08:00
Eric Banks	f172c349f6	Adding the functionality to enable users to input a file of VCFs for -V. To do this I have added a RodBindingCollection which can represent either a VCF or a file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so that one can intermix VCFs and VCF lists. For VariantContext tags with a list, by default the tags for the -V argument are applied unless overridden by the individual line. In other words, any given line can have either one token (the file path) or two tokens (the new tags and the file path). For example: foo.vcf VCF,name=bar bar.vcf Note that a VCF list file name must end with '.list'. Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.	2014-01-08 00:45:00 -05:00
Eric Banks	c133909d32	Fixed edge condition in the realigner where a realigned read can sometimes get partially aligned off the end of the contig. Now we ignore such reads (which is much easier than trying to figure out when to soft-clip). Added unit test.	2014-01-08 00:37:28 -05:00
Menachem Fromer	e33d3dafc6	Add documentation for RBP, and also update the MD5 for the tests now that the output uses HP tags instead of '\|', which is now reserved for trio-based phasing	2014-01-03 12:04:47 -05:00
Menachem Fromer	d1275651ae	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2014-01-03 01:13:40 -05:00
Ryan Poplin	856c1f87c1	Allow for additional input data to be used in the VQSR for clustering but don't carry it forward into the output VCF file. -- New -a argument in the VQSR for specifying additional data to be used in the clustering -- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.	2014-01-02 14:46:04 -05:00
amilev	f81a38f596	Merge pull request #446 from broadinstitute/ami-RNAseq-tools Write a new tool for spliting reads that have N cigar string.	2014-01-01 21:06:25 -08:00
MauricioCarneiro	1223345726	Merge pull request #459 from broadinstitute/eb_fix_bad_hmm_clipping Fixed up edge condition for clipping long reads in the HMM.	2014-01-01 20:00:34 -08:00
Ami Levy-Moonshine	6da53aea09	Write a new tool for spliting reads that have N cigar string. For example, this tool can be used for processing bowtie RNA-seq data. Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases. In order to do it, few changes were introduced to some other clipping methods: - make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I. - change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns create unitTests for that walker: - change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication - move some useful methods from ReadClipperTestUtils to CigarUtils create integration test for that class small change in a comment in FullProcessingPipeline last commit: Address review comments: - move to protected under walkers/rnaseq - change the read splitting methods to be more readable and more efficiant - change (minor changes) some methods in ReadClipper to allow the changes in split reads - add (minor change) one method to CigarUtils to allow the changes in split reads - change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar - address the rest of the review comments (minor changes) - fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion). - add another test to ReadUtilsUnitTest.testReadWithNs - Allow the user to print the split positions (not working proparly currently)	2014-01-01 22:21:36 -05:00
Eric Banks	bb4c4b1fcd	Fixed up edge condition for clipping long reads in the HMM. MD5s change because some reads were incorrectly getting clipped before. [delivers #62584746]	2014-01-01 19:05:09 -05:00
Mauricio Carneiro	d52bd44867	Move CompareBAMs to private This is a tool that we use internally validate the ReduceReads development. I think it should be private. There is no need to improve docs. [delivers #54703398]	2014-01-01 14:33:23 -05:00
Eric Banks	9665f75ad4	Don't fail in annotations if the wrong tools are calling them, just silently skip them. This is important for cases when users want to use annotation groups (like all experimental annotations).	2013-12-31 23:45:21 -05:00
Eric Banks	83e09b1f64	Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline. Basically, it does 3 things (as opposed to having to call into 3 separate walkers): 1. merge the records at any given position into a single one with all alleles and appropriate PLs 2. re-genotype the record using the exact AF calculation model 3. re-annotate the record using the VariantAnnotatorEngine In the course of this work it became clear that we couldn't just use the simpleMerge() method used by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new utility method to handle this merging and pulled any related code out of CombineVariants. I tried to clean up a lot of that code, but ultimately that's out of the scope of this project. Added unit tests for correctness testing. Integration tests cannot be used yet because the HC doesn't output correct gVCFs.	2013-12-31 12:07:56 -05:00
Menachem Fromer	48ef7a1a2f	Merge remote-tracking branch 'origin/master' into mf_new_RBP	2013-12-19 10:42:20 -05:00
Valentin Ruano-Rubio	5db520c6fa	Fixed issue > 0 log likelihoods using GraphBased likelihood engine reported by Mauricio Added some integration test to check on the fix	2013-12-13 11:19:57 -05:00
Eric Banks	ab33db625f	Merge pull request #449 from broadinstitute/eb_move_calc_posteriors_to_protected Moved CalculatePosteriors from private to protected, in preparation for 3.0	2013-12-07 22:18:46 -08:00
Eric Banks	f1970b923e	Moved CalculatePosteriors from private to protected, in preparation for 3.0. Renamed it CalculateGenotypePosteriors. Also, moved the utility code to a proper utility class instead of where Chris left it. No actual code modifications made in this commit.	2013-12-08 00:08:34 -05:00
David Roazen	932cd3ada7	Fix 3rd-party library dependency issues in the HC/PairHMM tests In general, test classes cannot use 3rd-party libraries that are not also dependencies of the GATK proper without causing problems when, at release time, we test that the GATK jar has been packaged correctly with all required dependencies. If a test class needs to use a 3rd-party library that is not a GATK dependency, write wrapper methods in the GATK utils/* classes, and invoke those wrapper methods from the test class.	2013-12-06 13:16:55 -05:00
Eric Banks	e022db4690	Added docs for the minPruning argument in the HC	2013-12-05 11:50:56 -05:00
Geraldine Van der Auwera	3ab2f4edb2	Fixed documentation for -deletions argument in the UAC	2013-12-04 19:55:24 -05:00
amilev	0d94019bd6	Merge pull request #434 from broadinstitute/mc_dt_gccontent Add GC Content to DiagnoseTargets	2013-12-04 09:42:26 -08:00
Joel Thibault	5fe0531b4d	Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy	2013-12-03 23:12:14 -05:00
Mauricio Carneiro	701ede2817	Add GC Content to DiagnoseTargets	2013-12-03 23:04:40 -05:00
Eric Banks	6bee6a1b53	Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles. Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info. Now the PLs and AD get correctly selected down. While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just broke it out into smaller chunks as per our best practices. Added unit tests and updated integration tests. Addressed reviews.	2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio	0f99778a59	Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test	2013-12-02 19:37:19 -05:00
Eric Banks	84ddfb41b5	Merge pull request #438 from broadinstitute/rp_vqsr_num_bad_stability_fixes_and_runtime_optimizations Various VQSR optimizations in runtime and stability.	2013-12-02 08:37:37 -08:00
Ryan Poplin	6a922e7aca	Merge pull request #435 from broadinstitute/eb_fix_ug_bug_for_long_deletions Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.	2013-12-02 08:09:23 -08:00
Ryan Poplin	b57054c63c	Various VQSR optimizations in both runtime and accuracy. -- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model. -- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation. -- Removed the training set quality score threshold. -- Now uses 2 gaussians by default for the negative model. -- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores. -- Model plots are now generated much faster. -- Stricter threshold for determining model convergence. -- All VQSR integration tests change because of these changes to the model. -- Add test for downsampling of training data.	2013-11-29 13:04:46 -05:00
Eric Banks	df6499e58c	Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int. For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing the min MQ filter. Added unit test. Delivers PT#61567540.	2013-11-27 18:55:03 -05:00
Eric Banks	51d1a26725	Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads. His code was excessively clipping reads because it was looking at their cigar string instead of just the read length. This meant that it was basically impossible to call large deletions in UG even with perfect evidence in the reads (as reported by Craig D). Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one being genotyped used to be given too much likelihood and now give less. Added unit tests for new methods.	2013-11-27 13:54:39 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Eric Banks	0fac4fb3b6	Make the reference model calculation work with reduced reads. It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.	2013-11-21 10:53:33 -05:00
Eric Banks	adb77b406f	Fixed poor implementation of isRefSource() and isRefSink() among others. There was already a note in the code about how wrong the implementation was. The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails. Delivers PT #61069820.	2013-11-21 10:53:27 -05:00
Ami Levy-Moonshine	e6ef37de1d	Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality	2013-11-18 17:29:32 -05:00
MauricioCarneiro	7f08250870	Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2 Improve the PairHMM API for better FPGA integration	2013-11-14 10:47:07 -08:00
bradtaylor	e40a07bb58	Improve the PairHMM API for better FPGA integration Motivation: The API was different between the regular PairHMM and the FPGA-implementation via CnyPairHMM. As a result, the LikelihoodCalculationEngine had to use account for this. The goal is to change the API to be the same for all implementations, and make it easier to access. PairHMM PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap. Added a new primary method that loops the reads and haplotypes, extracts qualities, and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method. Did not alter that method, or its subcompute method, at all. PairHMM also now handles its own (re)initialization, so users don't have to worry about that. CnyPairHMM Added that same new primary access method to this FPGA class. Method overrides the default implementation in PairHMM. Walks through a list of reads. Individual-read quals and the full haplotype list are fed to batchAdd(), as before. However, instead of waiting for every read to get added, and then walking through the reads again to extract results, we just get the haplotype-results array for each read as soon as it is generated, and pack it into a perReadAlleleLikelihoodMap for return. The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not. LikelihoodCalculationEngine The functionality to loop through the reads and haplotypes and get individual log10-likelihoods was moved to the PairHMM, and so removed from here. However, this class does need to retain the ability to pre-process the reads, and post-process the resulting likelihoods map. Those features were separated from running the HMM and refactored into their own methods Commented out the (unused) system for finding best N haplotypes for genotyping. PairHMMIndelErrorModel Similar changes were made as to the LCE. However, in this case the haplotypes are modified based on each individual read, so the read-list we feed into the HMM only has one read.	2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera	dac3dbc997	Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle Clarified caveat for InbreedingCoefficient Cleaned up docstrings for ReduceReads Brushed up doc for ErrorRatePerCycle	2013-11-13 14:33:04 -05:00
Eric Banks	0e3d83d1ef	Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline Improvements to the reference model pipeline.	2013-11-05 06:33:17 -08:00
Eric Banks	09dfaf1a68	Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline Add interpretation to QualifyMissingIntervals	2013-11-05 06:08:13 -08:00
Ryan Poplin	b22c9c2cb4	Improvements to the reference model pipeline. -- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants) -- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.	2013-11-01 17:58:25 -04:00
Mauricio Carneiro	5ed47988b8	Changed the parameter names from cds to baits Making the usage more clear since the parameter is being used over and over to define baited regions. Updated the headers accordingly and made it more readable.	2013-10-24 17:15:56 -04:00
Chris Hartl	9d932e8c60	Merged bug fix from Stable into Unstable	2013-10-10 14:31:33 -04:00
Chris Hartl	6f46d1187a	Remember to copy the integration test changes as well as the walker changes	2013-10-10 14:30:37 -04:00
Mauricio Carneiro	5d6421494b	Fix mismatching number of columns in report Quick fix the missing column header in the QualifyMissingIntervals report. Adding a QScript for the tool as well as a few minor updates to the GATKReportGatherer.	2013-10-09 14:38:15 -04:00
Mauricio Carneiro	63ace685c9	add unit tests	2013-10-04 11:44:07 -04:00
Mauricio Carneiro	839b918f58	Length metric updates to QualifyMissingIntervals * add a length of the overlaping interval metric as per CSER request * standardized the distance metrics to be positive when fully overlapping and the longest off-target tail (as a negative number) when not overlapping * add gatkdocs to the tool (finally!)	2013-10-04 10:18:13 -04:00
Geraldine Van der Auwera	9f7fa247f6	Disable VQSR tranche plots in INDEL mode	2013-09-30 17:14:37 -04:00
Ryan Poplin	ef1d58b7ff	Bugfix for hom ref records that aren't GVCF blocks.	2013-09-29 19:19:26 -04:00
Geraldine Van der Auwera	27808d336a	Minor clarifications regarding ignoreFilter argument	2013-09-26 13:13:53 -04:00
Geraldine Van der Auwera	a9fa5206ee	Merge pull request #399 from broadinstitute/eb_update_docs_for_DepthPerSampleHC Updated docs for DepthPerSampleHC to deliver PT #54237024.	2013-09-25 15:20:19 -07:00
Ryan Poplin	f362597f69	Merge pull request #400 from broadinstitute/mm_bugfix_combine_variants_implicit_casting Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.	2013-09-25 11:47:17 -07:00
Michael McCowan	5113e21437	Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion. * Updated expected test data in which an integer annotation (MQ0) was formatted as a double.	2013-09-25 13:12:02 -04:00
Eric Banks	2783c84c6b	Updated docs for DepthPerSampleHC to deliver PT #54237024 .	2013-09-24 22:32:19 -04:00
Eric Banks	d6992d1263	Updated docs to tell users not to use PCR indel error modeling for PCR free data.	2013-09-23 15:48:47 -04:00
Mauricio Carneiro	5bbad75402	Changing max coverage threshold Because Integer.maxValue is not unit testable	2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera	175388de1d	Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals	2013-09-20 15:12:16 -07:00
Mauricio Carneiro	5e2ffc74fc	Automated interpretation for QualifyMissingIntervals * add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval * add unit tests -- this tool is now public, we need tests. * slightly better docs -- in an effort to produce better docs for this tool	2013-09-20 16:47:12 -04:00
Mauricio Carneiro	74639463b9	Updating excessive coverage default parameter most people don't care about excessive coverage (unless you're very particular about your analysis). Therefore the best possible default value for this is Integer.maxValue so it doesn't get in the way. Itemized Changes: * change maximumCoverage threshold to Integer.maxValue [delivers #57353620]	2013-09-19 23:07:20 -04:00
MauricioCarneiro	014bc4269e	Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation Add Array Logless PairHMM	2013-09-08 20:16:53 -07:00
Ryan Poplin	3503050a39	Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller -- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller. -- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median -- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs -- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model -- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions -- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings -- We only realign reads in the reference model if the read is informative for a particular haplotype over another -- GVCF blocks will now track and output the minimum PLs over the block -- MD5 changes! -- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny -- GVCF tests: from HC changes above and adding in active region trimming	2013-09-06 16:56:34 -04:00
Ryan Poplin	add17dc463	Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error Changed the error for the record size mismatch in the genotyping engine ...	2013-08-30 10:29:54 -07:00
Eric Banks	ea0deb1bb2	Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to handle in the code).	2013-08-30 12:18:19 -04:00
Louis Bergelson	4473b0065e	adding a check for the UNAVAILABLE case of GenotypeType in CountVariants	2013-08-29 17:27:00 -04:00
bradtaylor	0435bbe38f	Retreived PairHMM benchmarks from archive and made improvements PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely. Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords". I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data. It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to: PairHMMBandedEmpiricalBenchmark PairHMMTestData BandedLoglessPairHMMUnitTest I decided against this as the banded benchmark and unit test are archived.	2013-08-28 17:23:35 -04:00
bradtaylor	86fe9fae76	Changes to Array PairHMM to address review comments Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.	2013-08-28 17:23:29 -04:00
bradtaylor	3671e41b0c	Add Array Logless PairHMM A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version. Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix. This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations. We also had to adjust the classes that employ the pairHMM: LikelihoodCalculationEngine (used by HaplotypeCaller) PairHMMIndelErrorModel (used by indel genotyping classes) Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection. The latter affects classes: ErrorModel GeneralPloidyIndelGenotypeLikelihoodsCalculationModel IndelGenotypeLikelihoodsCalculationModel ... all of which use the pairHMM via PairHMMIndelErrorModel	2013-08-28 17:21:23 -04:00
Ryan Poplin	7479152977	Merged bug fix from Stable into Unstable	2013-08-28 12:40:25 -04:00
Ryan Poplin	6bda569666	One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one.	2013-08-28 12:40:08 -04:00
Geraldine Van der Auwera	ed465cd2a5	Fixed a few typos and clarified some doc points.	2013-08-26 17:33:17 -04:00
David Roazen	42d771f748	Remove org.apache.commons.collections.IteratorUtils dependency from the test suite -This was a dependency of the test suite, but not the GATK proper, which caused problems when running the test suite on the packaged GATK jar at release time -Use GATKVCFUtils.readVCF() instead	2013-08-21 19:44:02 -04:00
Eric Banks	d4dc5ba04a	Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records. Added test to make sure this is no longer happening.	2013-08-21 15:46:57 -04:00
Eric Banks	6663d48ffe	Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates Adaptations to accomodate Tribble API changes.	2013-08-19 18:31:02 -07:00
Michael McCowan	c3a933ce84	Adaptations to accomodate Tribble API changes, comprising mostly of the following. * Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable. * Galvanizing fo generic types. * Test fixups, mostly to pass around LineIterators instead of LineReaders. * New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances. * New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).	2013-08-19 15:52:47 -04:00
Eric Banks	17eb7b49fe	Adding ability to use Ryan's PCR error modeling to the Haplotype Caller. There is now a command-line option to set the model to use in the HC. Incorporated Ryan's current (unmerged) branch in for most of these changes. Because small changes to the math can have drastic effects, I decided not to let users tweak the calculations themselves. Instead they can select either NONE, CONSERVATIVE (the default), or AGGRESSIVE. Note that any base insertion/deletion qualities from BQSR are still used. Also, note that the repeat unit x repeat length approach gave very poor results against the KB, so it is not included as an option here.	2013-08-16 01:53:04 -04:00
Eric Banks	69e78efeae	Merge pull request #366 from broadinstitute/gg_gatkdocfixes More gatkdoc fixes	2013-08-13 04:52:03 -07:00
Eric Banks	bcf9a1cda5	Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR Adding mode to VQSR to not output variant records that are filtered out ...	2013-08-12 12:01:50 -07:00
Ryan Poplin	a45011d7e7	Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling.	2013-08-12 11:22:59 -04:00
Ryan Poplin	59f56bef30	Cleaning up help text for the -numBad argument.	2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera	4d20c71e09	Improvements to various gatkdocs - Make -rod required - Document that contaminationFile is currently not functional with HC - Document liftover process more clearly - Document VariantEval combinations of ST and VE that are incompatible - Added a caveat about using MVLR from HC and UG. - Added caveat about not using -mte with -nt - Clarified masking options - Fixed docs based on Erics comments	2013-08-10 10:01:31 -07:00
Mark DePristo	b7d1096ced	Added onlyEmitSamples argument to UnifiedGenotyper -- When provided, this argument causes us to only emit the selected samples into the VCF. No INFO field annotations (AC for example) or other features are modified. It's current primary use is for efficiently evaluating joint calling. -- Add integration test for onlyEmitSamples	2013-08-09 11:00:15 -04:00
Mark DePristo	ccf0df0fea	Misc. debugging functionality to FS calculation (disabled by default)	2013-08-08 12:06:23 -04:00
Mark DePristo	00f4d767e4	Merge pull request #364 from broadinstitute/md_vqsr_improvements Separate num Gaussians for + and - GMM in VQSR	2013-08-07 04:37:45 -07:00
Mark DePristo	c21402d4af	Separate num Gaussians for + and - GMM in VQSR -- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models. However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model. -- Now the maxGaussians argument only applies to the positive model -- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter) -- Removes the percentBadVariants argument. The only way to control how many variants are included in the negative model is with minNumBad -- Reduced the minNumBad argument default to 1000 from 2500 -- Update MD5s for VQSR. md5s changed significantly due to underlying changes in the default GMM model. Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected. -- minNumBad is now numBad -- Plot all negative training points as well, since this significantly changes our view of the GMM PDF	2013-08-07 07:36:50 -04:00
Mark DePristo	318f7e74e4	Better docs on the meaning of heterozygosity -- [delivers #53522209]	2013-08-07 07:27:45 -04:00
Mark DePristo	40bc7d6a9c	Bugfix for ReferenceConfidenceModel -- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype". Now we detect this case and emit the standard no variation output. -- [delivers #54625060]	2013-08-06 16:00:32 -04:00
Ryan Poplin	a46f633bd6	Fix for the VQSR visualization script with the new ordering of annotations.	2013-08-02 19:10:45 -04:00
Mauricio Carneiro	285ab2ac62	Better caching for the HaplotypeCaller Problem ------- Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless. Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting tons of re-compute. Solution ------- Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.	2013-08-02 01:27:29 -04:00
Eric Banks	1e396af4d0	Two reduce reads updates/fixes: 1. Removing old legacy code that was capping the positional depth for reduced reads to 127. Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers. Added end to end unit test that depth counts in RR can be higher than max byte. Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127. 2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads. Refactored it so that it can only remove non-consensus reads.	2013-08-01 14:34:59 -04:00
Ryan Poplin	4f3411f3d4	Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples.	2013-07-31 10:48:55 -04:00
Yossi Farjoun	284176cd7b	moved SnpEffUtilUnitTest to public tree	2013-07-30 17:51:40 -04:00
droazen	b8709b1942	Merge pull request #332 from broadinstitute/st_fpga_hmm FPGA support for PairHMM	2013-07-30 14:21:21 -07:00
Joseph Rose	d2860a5486	Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose)	2013-07-30 15:38:32 -04:00
Mauricio Carneiro	7b731dd596	Removed native method call and fixed indentation.	2013-07-30 13:59:58 -04:00
Geraldine Van der Auwera	edbd17b8e0	Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode	2013-07-26 15:51:29 -04:00
Ryan Poplin	f52196496d	Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements More specific fix for the dangling tail edge case with a single leading deletion.	2013-07-26 07:25:47 -07:00
Ryan Poplin	8c205dda1b	Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line.	2013-07-26 10:22:43 -04:00
Eric Banks	9372c5ef41	Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals QualifyMissingIntervals: support different formats	2013-07-25 12:39:26 -07:00
sathibault	71eb944e62	Adding CnyPairHMMUnitTest	2013-07-25 14:19:50 -05:00
Eric Banks	5dfa863caa	Fully stranded implementation of RR (plus bug fix for insertions and het compression). Now only filtered reads are unstranded. All consensus reads have strand, so that we emit 2 consensus reads in general now: one for each strand. This involved some refactoring of the sliding window which cleaned it up a lot. Also included is a bug fix: insertions downstream of a variant region weren't triggering a stop to the compression.	2013-07-25 14:48:53 -04:00
Eric Banks	0a2b5ddadf	More specific fix for the dangling tail edge case with a single leading deletion. The previous fix was too general (and therefore incorrect) and caused the HC to exception out. Added "unit" test for this exact case.	2013-07-25 12:24:46 -04:00
Mauricio Carneiro	31ab0824b1	quick indentation fixes to FPGA code	2013-07-24 14:09:49 -04:00
Eric Banks	6df43f730a	Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes. Having them as bytes caused problems for downstream programmers who had data with high MQs.	2013-07-23 23:47:15 -04:00
Guillermo del Angel	9dd109b79a	Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised. So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before. Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.	2013-07-18 12:54:52 -04:00
Scott Thibault	5d198d3400	Added write to likelihoods.txt for batch hmm	2013-07-15 10:16:39 -05:00
sathibault	0a8f75b953	Merge branch 'master' into st_fpga_hmm Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java	2013-07-15 08:17:32 -05:00
Mauricio Carneiro	8c07614321	QualifyMissingIntervals: support different formats Problem ------- Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters. Solution ------- There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle). ps: Also added an average depth column to the output	2013-07-12 17:32:53 -04:00
Yossi Farjoun	afcf7b96db	- Added per-sample AlleleBiasedDownsampling capability to HaplotypeCaller - Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF - overrode default contamination value in test	2013-07-12 16:22:02 -04:00
Eric Banks	b16c7ce050	A whole slew of improvements to the Haplotype Caller and related code. 1. Some minor refactorings and claenup (e.g. removing unused imports) throughout. 2. Updates to the KB assessment functionality: a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call. b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling. 3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived the DeBruijn assembler. 4. Improvements to the likelihoods for the HC a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need to divide e by 3 because the observed base could have come from any of the non-observed alleles. b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are turned into Q0s for now. c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real sites with off kilter allele balances and we occasionally lose them to contamination removal. 5. Improved the dangling tail merging implementation.	2013-07-12 10:09:10 -04:00
sathibault	23fe3e449a	Revert "Fixed batching bug." This reverts commit 3e56c83d0eec7c374e5f187d1ef124d42ecc071e.	2013-07-11 11:30:37 -05:00
sathibault	7458b59bb3	Fixed batching bug.	2013-07-11 11:08:46 -05:00
Menachem Fromer	a8ea57df9e	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-07-10 16:44:35 -04:00
Guillermo del Angel	aba55dbb23	Moved some HC parameters related to active region extensions to command line arguments so that they're more easily modified. Some of these parameters need tinkering in order to call some large indels. See GSA-891 and subtasks for particular examples thereof.	2013-07-10 14:31:10 -04:00
Eric Banks	73fc7f6ab1	Reduce Reads output should never be expected to be sorted (hence the need to sort on disk) but for some reason it was with -nwayout mode.	2013-07-08 10:33:36 -04:00

... 5 6 7 8 9 ...

1463 Commits (e34ec0fbbba2dc1d424449ed30dc3b53e4ce6f6f)