Commit Graph

1055 Commits (6bba8d714799bc69b0538e38c76d405730d10aaa)

Author SHA1 Message Date
Valentin Ruano-Rubio 84711b8e90 Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale)
Stories:

  https://www.pivotaltracker.com/story/show/66263868

Bug:

  The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
  or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
  in linear scale computations that ins transformed into an infinity in log scale.

Changes:

  - Change to use log10 scale for calculate those penalties.
  - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
2014-04-01 16:14:52 -04:00
Eric Banks 821fbe7260 Merge pull request #582 from broadinstitute/vrr_hc_bugfixes_dangling_heads
Fix loss of key alternative haplotypes due to a change on threading star...
2014-03-31 10:42:08 -04:00
Joel Thibault 2049eb1658 Rev Picard 1.110.1763
- SamPairUtils migrated in Picard r1737
- Revert IndelRealigner changes made in commit 4f4b85
-- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java
-- Picard revision 1723 reverts these changes, so we also revert to match
2014-03-30 09:33:57 -04:00
Valentin Ruano-Rubio 258b2bce28 Fix loss of key alternative haplotypes due to a change on threading start policy required when recovering dangling heads.
Story:

  - https://www.pivotaltracker.com/story/show/67601310

Change:

  - Unless recover-danging-heads is active, the threading starting location policy is the original one. i.e. just at already existing unique kmer vertices.

Tests:

  - HaplotypeCallerIntegrationTest#testMissingKeyAlternativeHaplotypesBugFix
2014-03-29 22:40:26 -04:00
Ryan Poplin 6566dd6ca9 Fix for dropping of reference sample depth in the DP annotation.
-- In the case of hierarchical merge we can't assume that we have only one genotype.
-- Removed use of deprecated VC annotation access functions.
2014-03-24 14:01:50 -04:00
Eric Banks 32a96e3ab3 Fix for reads that are all insertions (e.g. 50I) and causing the IndelRealigner to error out. 2014-03-21 15:01:34 -04:00
Eric Banks 7c8ce3cd6a Several improvements to GenotypeGVCFs: --includeNonVariantSites now actually works and we propagate AD to hom ref samples 2014-03-20 00:35:54 -04:00
Eric Banks 824983af1d Enable CombineGVCFs to process gVCFs that were created with basepair resolution. 2014-03-19 19:23:05 -04:00
Eric Banks 3b1c337401 Have CombineVariants throw a UserError when trying to combine GVCFs from the HaplotypeCaller.
Was previously throwing an IllegalArgumentException (in the wrong place in the code).
Error message tells users to use CombineGVCFs.
2014-03-19 19:11:40 -04:00
Valentin Ruano-Rubio 905b6066b2 Reduce runtime of very long integration test 2014-03-18 21:48:13 -04:00
David Roazen 2d8653f493 Update pom versions to mark the start of GATK 3.2 development 2014-03-18 01:18:59 -04:00
David Roazen a6a41c777c Update pom versions for 3.1 2014-03-18 01:09:29 -04:00
Alec Wysoker 0369f93b24 GATK changes to conform to Tribble refactoring as part improving Tabix support in Tribble (among other things).
1. Enable on-the-fly indexing for vcf.gz.
2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created.
3. Add method setProgressLogger to all SAMFileWriter implementations.
4. Revved picard to 1.109.1722
5. IndelRealigner md5s change because the MC tag is added to records now.

Fixed up and signed off by ebanks.
2014-03-17 11:56:22 -04:00
Eric Banks 34c697bf12 Merge pull request #554 from broadinstitute/bh_SOR_new_annotation
Bh sor new annotation
2014-03-17 10:58:13 -04:00
Laura Gauthier 40c13d446a Added documentation category for CalculateGenotypePosteriors 2014-03-17 10:36:19 -04:00
Valentin Ruano-Rubio 2e964c59b4 Improved criteria to select best haplotypes out from the assembly graph.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.

The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.

Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:

As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
  ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.

This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:

Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.

This pull-request addresses the following stories:

https://www.pivotaltracker.com/story/show/66691418
https://www.pivotaltracker.com/story/show/64319760

Change Summary:

1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
2014-03-14 18:37:01 -04:00
Bertrand Haas 82108d110f New abstract class StrandBiasTest() with old sub-class FisherStrand() and new sub-class StrandOddsRatio(). Latter is test based on symmetric odds ratio more appropriate than Fisher exact test when number of samples is large.
https://www.pivotaltracker.com/story/show/66087886
2014-03-14 18:33:21 -04:00
Eric Banks 7c7ff90266 Merge pull request #558 from broadinstitute/rp_vqsr_nondeterminism_fix
Fix for non-determinism in the VQSR with very large data sets
2014-03-12 14:35:51 -04:00
Eric Banks ffaf92f871 Added new functionality to the FastaAlternateReferenceMaker to have it output IUPAC codes for het sites.
Enable it with the new --useIUPAC argument.
Added both unit and integration tests for the new functionality - and fixed up the
exising tests once I was in there.
2014-03-12 14:31:57 -04:00
Ryan Poplin 907d1d6160 Fix for non-determinism in the VQSR with very large data sets 2014-03-12 10:25:12 -04:00
ldgauthier 4e74e77e74 Merge pull request #555 from broadinstitute/eb_add_option_to_CGVCFs_for_all_sites_GVCF
Added an option to CombineGVCFs to create basepair resolution gVCFs from...
2014-03-12 10:01:18 -04:00
David Roazen c67ced5f3b Emit a warning whenever the VectorLoglessPairHMM is used 2014-03-12 09:55:35 -04:00
Eric Banks d697e0144f Added an option to CombineGVCFs to create basepair resolution gVCFs from banded ones.
Use the --convertToBasePairResolution argument to enable this functionality.
2014-03-12 01:32:51 -04:00
Ryan Poplin 34d11fe40c Added the consensus mode used for the 1000 Genomes Project to the HaplotypeCaller.
-- All the provided alleles are added to the assembly graph as potential haplotypes but they aren't forcibly genotyped like in GGA mode.
-- Added integration test for this mode
2014-03-11 09:56:35 -04:00
droazen 8b53567dc7 Merge pull request #553 from broadinstitute/dr_rename_pipeline_tests
Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
2014-03-10 21:36:45 -04:00
David Roazen 78562c14bb Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
-These tests are really integration tests for Queue rather than generalized
 pipeline tests, so it makes sense to call them QueueTests.

-Rename test classes and maven build targets, and update shell scripts
 to reflect new naming.
2014-03-10 21:24:03 -04:00
David Roazen 7c34f05082 Merge remote-tracking branch 'origin/master' into intel 2014-03-10 14:07:36 -04:00
David Roazen 5a6aa54673 Revert "Update HaplotypeCaller and VariantAnnotator test MD5s"
This reverts commit 7faa44d576b06d7aef29562e82590a7855f216f4.
2014-03-10 14:06:51 -04:00
David Roazen e7d6db033b Revert "Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING""
This reverts commit c8a34749e631b92214a57bba162c6e0d849425f1.
2014-03-10 14:05:51 -04:00
David Roazen f070583f29 Update HaplotypeCaller and VariantAnnotator test MD5s
There are a few innocuous test failures on this branch --
updating MD5s after reviewing the differences in output
2014-03-07 10:54:27 -05:00
Karthik Gururaj 6e98e9e589 Removed g_haplotype* global variables in native code so that it works
with multi-threading in Java.
Modified VectorLoglessPairHMM.java so that jniInitializeRegion and
jniFinalizeRegion are empty
2014-03-06 22:08:35 -08:00
David Roazen 3f3df90412 Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING"
This reverts commit cef03f089fb3f131f3a77664b71feaec51a74cc8.
2014-03-06 10:15:35 -05:00
David Roazen 9df59bd8cc Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
David Roazen 34edcb8ddf Update pom versions for the 3.0 release 2014-03-05 23:37:21 -05:00
David Roazen 53895e15cd Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING 2014-03-05 19:26:37 -05:00
Eric Banks d3de6413c9 Move warnings to debug logging status because they will definitely scare users 2014-03-05 15:05:21 -05:00
Karthik Gururaj 51b8ea5d59 Reset version 2014-03-05 11:19:08 -08:00
Karthik Gururaj b9afe800ae Merge correction 2014-03-05 10:06:45 -08:00
Karthik Gururaj 8fcbf9272c Merge branch 'intel_pairhmm' of /data/broad/gsa-unstable into intel_pairhmm
Conflicts:
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
	public/VectorPairHMM/src/main/c++/Sandbox.java
2014-03-05 09:35:50 -08:00
Intel Repocontact d81116eb1d Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK.
C++ code has PAPI calls for reading hardware counters

Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven

Native library part of git repo

1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM

Moved JNI_README

1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.

1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase

Added public license text to all C++ files

Added license to Makefile

Add package info to Sandbox.java

Conflicts:
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java
	public/VectorPairHMM/src/main/c++/.gitignore
	public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc
	public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h
	public/VectorPairHMM/src/main/c++/Makefile
	public/VectorPairHMM/src/main/c++/Sandbox.cc
	public/VectorPairHMM/src/main/c++/Sandbox.h
	public/VectorPairHMM/src/main/c++/Sandbox.java
	public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h
	public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h
	public/VectorPairHMM/src/main/c++/baseline.cc
	public/VectorPairHMM/src/main/c++/define-double.h
	public/VectorPairHMM/src/main/c++/define-float.h
	public/VectorPairHMM/src/main/c++/define-sse-double.h
	public/VectorPairHMM/src/main/c++/define-sse-float.h
	public/VectorPairHMM/src/main/c++/headers.h
	public/VectorPairHMM/src/main/c++/jnidebug.h
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h
	public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc
	public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc
	public/VectorPairHMM/src/main/c++/run.sh
	public/VectorPairHMM/src/main/c++/shift_template.c
	public/VectorPairHMM/src/main/c++/utils.cc
	public/VectorPairHMM/src/main/c++/utils.h
	public/VectorPairHMM/src/main/c++/vector_function_prototypes.h
2014-03-05 09:30:29 -08:00
Laura Gauthier 43fdd38342 Add error handling to CalculateGenotypePosteriors to catch multiallelic variants with wrong number of ACs
-- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests
Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests
Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases
2014-03-05 12:03:18 -05:00
Laura Gauthier 7f9f58dbd1 Added hidden flag to GenotypeConcordance to output sites of discordant genotypes (to System.out)
Revised ConcondanceMetrics tests to adapt to change
Added comments to PosteriorLikelihoodsUtils
2014-03-05 12:03:18 -05:00
Karthik Gururaj 2648b41398 Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK.
C++ code has PAPI calls for reading hardware counters

Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven

Native library part of git repo

1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM

Moved JNI_README

1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.

1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase

Added public license text to all C++ files

Added license to Makefile

Add package info to Sandbox.java
2014-03-05 08:31:24 -08:00
Intel Repocontact 6aa67a2585 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2014-03-05 08:14:32 -08:00
Intel Repocontact 1de2d2546e Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2014-03-05 00:04:21 -08:00
Valentin Ruano-Rubio 69bf2b3247 Added a more efficient implementation of the KBest haplotype finder code (CONT.)
Changes:

  1. Addressed review comments on new K-best haplotype assembly graph finder.
  2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices.
  3. Updated test to use KBestHaplotypeFinder instead of KBestPaths
  4. Retired KBestPaths to the archive.
  5. Small improvements to the code and documentation.
2014-03-04 23:22:27 -05:00
Valentin Ruano-Rubio 7acf2eb0e7 Added a more efficient implementation of the KBest haplotype finder code.
Story:

  https://www.pivotaltracker.com/story/show/66238286

Changes:

  1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder.
  2. Changed HC code to use the new implementation.
  This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases
  3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code.
     In any case that seems like a more natural location for that functionality.
2014-03-04 12:22:14 -05:00
Eric Banks b99bf85ec8 Fixed bug where dangling tail merging occasionally created a cycle in the graph.
Added unit tests to cover this case.  Delivers PT#66690470.
2014-03-03 22:42:56 -05:00
Eric Banks 4d69af189e Minor change: make the --dontUseSoftClippedBases @Advanced instead of @Hidden 2014-03-03 15:59:32 -05:00
Eric Banks fa65716fe9 Added code to retrieve dangling heads from the read threading graph (previously we were rescuing just the tails).
The purpose of this is to be able to call SNPs that fall at the beginning of a capture region (or exon).
Before, the read threading code would only start threading from the first kmer that matched the reference.  But
that means that, in the case of a SNP at the beginning of an exome, it wouldn't start threading the read until
after the SNP position - so we'd lose the SNP.

For now, this is still very experimental.  It works well for RNAseq data, but does introduce FPs in normal exomes.
I know why this is and how to fix it, but it requires a much larger fix to the HC: the HC needs to pass all reads
and bases to the annotation engine (like UG does) instead of just the high quality ones.  So for now, the head
merging is disabled by default.

As per reviewer comments, I moved the head and tail merging code out into their own class.
2014-03-03 15:59:26 -05:00