story:
https://www.pivotaltracker.com/story/show/69648104
description:
This read transformer will refactor cigar strings that contain N-D-N elements to one N element (with total length of the three refactored elements).
This is intended primarily for users of RNA-Seq data handling programs such as TopHat2.
Currently we consider that the internal N-D-N motif is illegal and we error out when we encounter it. By refactoring the cigar string of
those specific reads, users of TopHat and other tools can circumvent this problem without affecting the rest of their dataset.
edit: address review comments - change the tool's name and change the tool to be a readTransformer instead of read filter
CalculateGenotypePosteriors now only computes posterior probs for SNP sites with SNP priors
(other sites have flat priors applied)
CalibrateGenotypeLikelihoods had originally applied HOM_REF/HET/HOM_VAR frequencies in callset as priors before empirical quality analysis. Now has option (-noPriors) to not apply/apply flat priors. Also takes in new external probabilities files, such as those generated by CGP, from which the genotype posterior probability qualities will be read.
Integration test was changed to account for new SNP-only behavior and default behavior to not use missing priors.
(Also, new numRefIfMissing is 0, which should only matter in cases using few samples when you probably don't want to be doing that anyway!)
Description:
Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine
that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed.
Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE.
Added -ploidy argument although it is contraint to be 2 for now.
Added -out_mode EMIT_ALL_SITES|EMIT_VARIANTS_ONLY ...
Added -allSitePLs flag.
Stories:
https://www.pivotaltracker.com/story/show/68017394
Changes:
- Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller
engines. (There is still room for improvement though).
- Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them.
- Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make
GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell.
- Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments.
- GenotypingLikehooldCalculationCModel.Model to Gen.*Like.*Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the
GenLikeCalc classes.
- Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines.
- Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes.
- Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode.
Tests :
- StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion).
- All existing integration and unit tests for modified classes.
-- This is needed so the ref model pipeline can cut down to sites-only files without losing these useful statistics.
-- Added new unit test to test this info field annotation.
-- GenotypeGVCF integration tests change because new annotations are present in the output
Stories:
https://www.pivotaltracker.com/story/show/66263868
Bug:
The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
in linear scale computations that ins transformed into an infinity in log scale.
Changes:
- Change to use log10 scale for calculate those penalties.
- Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
- SamPairUtils migrated in Picard r1737
- Revert IndelRealigner changes made in commit 4f4b85
-- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java
-- Picard revision 1723 reverts these changes, so we also revert to match
Story:
- https://www.pivotaltracker.com/story/show/67601310
Change:
- Unless recover-danging-heads is active, the threading starting location policy is the original one. i.e. just at already existing unique kmer vertices.
Tests:
- HaplotypeCallerIntegrationTest#testMissingKeyAlternativeHaplotypesBugFix
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.
The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.
Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:
As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.
This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:
Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.
This pull-request addresses the following stories:
https://www.pivotaltracker.com/story/show/66691418https://www.pivotaltracker.com/story/show/64319760
Change Summary:
1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
-- All the provided alleles are added to the assembly graph as potential haplotypes but they aren't forcibly genotyped like in GGA mode.
-- Added integration test for this mode
C++ code has PAPI calls for reading hardware counters
Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven
Native library part of git repo
1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM
Moved JNI_README
1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.
1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase
Added public license text to all C++ files
Added license to Makefile
Add package info to Sandbox.java
Conflicts:
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java
public/VectorPairHMM/src/main/c++/.gitignore
public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc
public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h
public/VectorPairHMM/src/main/c++/Makefile
public/VectorPairHMM/src/main/c++/Sandbox.cc
public/VectorPairHMM/src/main/c++/Sandbox.h
public/VectorPairHMM/src/main/c++/Sandbox.java
public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h
public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h
public/VectorPairHMM/src/main/c++/baseline.cc
public/VectorPairHMM/src/main/c++/define-double.h
public/VectorPairHMM/src/main/c++/define-float.h
public/VectorPairHMM/src/main/c++/define-sse-double.h
public/VectorPairHMM/src/main/c++/define-sse-float.h
public/VectorPairHMM/src/main/c++/headers.h
public/VectorPairHMM/src/main/c++/jnidebug.h
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h
public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc
public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc
public/VectorPairHMM/src/main/c++/run.sh
public/VectorPairHMM/src/main/c++/shift_template.c
public/VectorPairHMM/src/main/c++/utils.cc
public/VectorPairHMM/src/main/c++/utils.h
public/VectorPairHMM/src/main/c++/vector_function_prototypes.h
-- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests
Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests
Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases
C++ code has PAPI calls for reading hardware counters
Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven
Native library part of git repo
1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM
Moved JNI_README
1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.
1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase
Added public license text to all C++ files
Added license to Makefile
Add package info to Sandbox.java
Changes:
1. Addressed review comments on new K-best haplotype assembly graph finder.
2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices.
3. Updated test to use KBestHaplotypeFinder instead of KBestPaths
4. Retired KBestPaths to the archive.
5. Small improvements to the code and documentation.
Story:
https://www.pivotaltracker.com/story/show/66238286
Changes:
1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder.
2. Changed HC code to use the new implementation.
This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases
3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code.
In any case that seems like a more natural location for that functionality.