gatk-3.8/public/VectorPairHMM/README.md

72 lines
4.6 KiB
Markdown
Raw Normal View History

Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK. C++ code has PAPI calls for reading hardware counters Followed Khalid's suggestion for packing libVectorLoglessCaching into the jar file with Maven Native library part of git repo 1. Renamed directory structure from public/c++/VectorPairHMM to public/VectorPairHMM/src/main/c++ as per Khalid's suggestion 2. Use java.home in public/VectorPairHMM/pom.xml to pass environment variable JRE_HOME to the make process. This is needed because the Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among others). Assuming that the Maven build process uses a JDK (and not just a JRE), the variable java.home points to the JRE inside maven. 3. Dropped all pretense at cross-platform compatibility. Removed Mac profile from pom.xml for VectorPairHMM Moved JNI_README 1. Added the catch UnsatisfiedLinkError exception in PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING in case the native library could not be loaded. Made VECTOR_LOGLESS_CACHING as the default implementation. 2. Updated the README with Mauricio's comments 3. baseline.cc is used within the library - if the machine supports neither AVX nor SSE4.1, the native library falls back to un-vectorized C++ in baseline.cc. 4. pairhmm-1-base.cc: This is not part of the library, but is being heavily used for debugging/profiling. Can I request that we keep it there for now? In the next release, we can delete it from the repository. 5. I agree with Mauricio about the ifdefs. I am sure you already know, but just to reassure you the debug code is not compiled into the library (because of the ifdefs) and will not affect performance. 1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java 2. Committing the right set of files after rebase Added public license text to all C++ files Added license to Makefile Add package info to Sandbox.java Conflicts: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java public/VectorPairHMM/src/main/c++/.gitignore public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h public/VectorPairHMM/src/main/c++/Makefile public/VectorPairHMM/src/main/c++/Sandbox.cc public/VectorPairHMM/src/main/c++/Sandbox.h public/VectorPairHMM/src/main/c++/Sandbox.java public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h public/VectorPairHMM/src/main/c++/baseline.cc public/VectorPairHMM/src/main/c++/define-double.h public/VectorPairHMM/src/main/c++/define-float.h public/VectorPairHMM/src/main/c++/define-sse-double.h public/VectorPairHMM/src/main/c++/define-sse-float.h public/VectorPairHMM/src/main/c++/headers.h public/VectorPairHMM/src/main/c++/jnidebug.h public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc public/VectorPairHMM/src/main/c++/run.sh public/VectorPairHMM/src/main/c++/shift_template.c public/VectorPairHMM/src/main/c++/utils.cc public/VectorPairHMM/src/main/c++/utils.h public/VectorPairHMM/src/main/c++/vector_function_prototypes.h
2014-03-06 01:30:29 +08:00
Implementation overview:
Created a new Java class called VectorLoglessPairHMM which extends LoglessPairHMM and
overrides functions from both LoglessPairHMM and PairHMM.
1. Constructor: Call base class constructors. Then, load the native library located in this
directory and call an init function (with suffix 'jniInitializeClassFieldsAndMachineMask') in the
library to determine fields ids for the members of classes JNIReadDataHolder and
JNIHaplotypeDataHolders. The native code stores the field ids (struct offsets) for the classes and
re-uses them for subsequent computations. Optionally, the user can disable the vector
implementation, by using the 'mask' argument (see comments for a more detailed explanation).
2. When the library is loaded, it invokes the constructor of the class LoadTimeInitializer (because
a global variable g_load_time_initializer is declared in the library). This constructor
(LoadTimeInitializer.cc) can be used to perform various initializations. Currently, it initializes
two global function pointers to point to the function implementation that is supported on the
machine (AVX/SSE/un-vectorized) on which the program is being run. The two pointers are for float
and double respectively. The global function pointers are declared in utils.cc and are assigned in
the function initialize_function_pointers() defined in utils.cc and invoked from the constructor of
LoadTimeInitializer.
Other initializations in LoadTimeInitializer:
* ConvertChar::init - sets some masks for the vector implementation
* FTZ for performance
* stat counters = 0
* debug structs (which are never used in non-debug mode)
This initialization is done only once for the whole program.
3. initialize(): To initialize the region for PairHMM. Pass haplotype bases to native code through
the JNIHaplotypeDataHolder class. Since the haplotype list is common across multiple samples in
computeReadLikelihoods(), we can pass the haplotype bases to the native code once and re-use across
multiple samples.
4. computeLikelihoods(): Copies array references for readBases/quals etc to array of
JNIReadDataHolder objects. Invokes the JNI function to perform the computation and updates the
likelihoodMap.
The JNI function copies the byte array references into an array of testcase structs and invokes the
compute_full_prob function through the function pointers initialized earlier.
The primary native function called is
Java_org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM_jniComputeLikelihoods. It uses
standard JNI calls to get and return data from/to the Java class VectorLoglessPairHMM. The last
argument to the function is the maximum number of OpenMP threads to use while computing PairHMM in
C++. This option is set when the native function call is made from JNILoglessPairHMM
computeLikelihoods - currently it is set to 12 (no logical reason).
Note: OpenMP has been disabled for now - insufficient #testcases per call to computeLikelihoods() to
justify multi-threading.
5. finalizeRegion(): Releases the haplotype arrays initialized in step 3 - should be called at the
end of every region (line 351 in PairHMMLikelihoodCalculationEngine).
Note: Debug code has been moved to a separate class DebugJNILoglessPairHMM.java.
Compiling:
Make sure you have icc (Intel C compiler) available. Currently, gcc does not seem to support all AVX
intrinsics.
This native library is called libVectorLoglessPairHMM.so
Using Maven:
Type 'mvn install' in this directory - this will build the library (by invoking 'make') and copy the
native library to the directory
${sting-utils.basedir}/src/main/resources/org/broadinstitute/sting/utils/pairhmm
The GATK maven build process (when run) will bundle the library into the StingUtils jar file from
the copied directory.
Simple build:
cd src/main/c++
make
Running:
The default implementation of PairHMM is now VECTOR_LOGLESS_CACHING in HaplotypeCaller.java. To use
the Java version, use the command line argument "--pair_hmm_implementation LOGLESS_CACHING". (see
run.sh in src/main/c++).
The native library is bundled with the StingUtils jar file. When HaplotypeCaller is invoked, then
the library is unpacked from the jar file, copied to the /tmp directory (with a unique id) and
loaded by the Java class VectorLoglessPairHMM in the constructor (if it has not been loaded
already).
The default library can be overridden by using the -Djava.library.path argument (see
src/main/c++/run.sh for an example) for the JVM to pass the path to the library. If the library
libVectorLoglessPairHMM.so can be found in java.library.path, then it is loaded and the 'packed'
library is not used.