gatk-3.8/public/VectorPairHMM/README.md

73 lines
4.8 KiB
Markdown
Raw Normal View History

Implementation overview:
Created a new Java class called VectorLoglessPairHMM which extends LoglessPairHMM and
overrides functions from both LoglessPairHMM and PairHMM.
1. Constructor: Call base class constructors. Then, load the native library located in this
directory and call an init function (with suffix 'jniInitializeClassFieldsAndMachineMask') in the
library to determine fields ids for the members of classes JNIReadDataHolder and
JNIHaplotypeDataHolders. The native code stores the field ids (struct offsets) for the classes and
re-uses them for subsequent computations. Optionally, the user can disable the vector
implementation, by using the 'mask' argument (see comments for a more detailed explanation).
2. When the library is loaded, it invokes the constructor of the class LoadTimeInitializer (because
a global variable g_load_time_initializer is declared in the library). This constructor
(LoadTimeInitializer.cc) can be used to perform various initializations. Currently, it initializes
two global function pointers to point to the function implementation that is supported on the
machine (AVX/SSE/un-vectorized) on which the program is being run. The two pointers are for float
and double respectively. The global function pointers are declared in utils.cc and are assigned in
the function initialize_function_pointers() defined in utils.cc and invoked from the constructor of
LoadTimeInitializer.
Other initializations in LoadTimeInitializer:
* ConvertChar::init - sets some masks for the vector implementation
* FTZ for performance
* stat counters = 0
* debug structs (which are never used in non-debug mode)
This initialization is done only once for the whole program.
3. initialize(): To initialize the region for PairHMM. Pass haplotype bases to native code through
the JNIHaplotypeDataHolder class. Since the haplotype list is common across multiple samples in
computeReadLikelihoods(), we can pass the haplotype bases to the native code once and re-use across
multiple samples.
4. computeLikelihoods(): Copies array references for readBases/quals etc to array of
JNIReadDataHolder objects. Invokes the JNI function to perform the computation and updates the
likelihoodMap.
The JNI function copies the byte array references into an array of testcase structs and invokes the
compute_full_prob function through the function pointers initialized earlier.
The primary native function called is
Java_org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM_jniComputeLikelihoods. It uses
standard JNI calls to get and return data from/to the Java class VectorLoglessPairHMM. The last
argument to the function is the maximum number of OpenMP threads to use while computing PairHMM in
C++. This option is set when the native function call is made from JNILoglessPairHMM
computeLikelihoods - currently it is set to 12 (no logical reason).
Note: OpenMP has been disabled for now - insufficient #testcases per call to computeLikelihoods() to
justify multi-threading.
5. finalizeRegion(): Releases the haplotype arrays initialized in step 3 - should be called at the
end of every region (line 351 in PairHMMLikelihoodCalculationEngine).
Note: Debug code has been moved to a separate class DebugJNILoglessPairHMM.java.
Compiling:
Parallel version of the JNI for the PairHMM The JNI treats shared memory as critical memory and doesn't allow any parallel reads or writes to it until the native code finishes. This is not a problem *per se* it is the right thing to do, but we need to enable **-nct** when running the haplotype caller and with it have multiple native PairHMM running for each map call. Move to a copy based memory sharing where the JNI simply copies the memory over to C++ and then has no blocked critical memory when running, allowing -nct to work. This version is slightly (almost unnoticeably) slower with -nct 1, but scales better with -nct 2-4 (we haven't tested anything beyond that because we know the GATK falls apart with higher levels of parallelism * Make VECTOR_LOGLESS_CACHING the default implementation for PairHMM. * Changed version number in pom.xml under public/VectorPairHMM * VectorPairHMM can now be compiled using gcc 4.8.x * Modified define-* to get rid of gcc warnings for extra tokens after #undefs * Added a Linux kernel version check for AVX - gcc's __builtin_cpu_supports function does not check whether the kernel supports AVX or not. * Updated PairHMM profiling code to update and print numbers only in single-thread mode * Edited README.md, pom.xml and Makefile for users to pass path to gcc 4.8.x if necessary * Moved all cpuid inline assembly to single function Changed info message to clog from cinfo * Modified version in pom.xml in VectorPairHMM from 3.1 to 3.2 * Deleted some unnecessary code * Modified C++ sandbox to print per interval timing
2014-03-18 02:42:19 +08:00
The native library (called libVectorLoglessPairHMM.so) can be compiled with icc (Intel C compiler)
or gcc versions >= 4.8.1 that support AVX intrinsics. By default, the make process tries to invoke
icc. To use gcc, edit the file 'pom.xml' (in this directory) and enable the environment variables
USE_GCC,C_COMPILER and CPP_COMPILER (edit and uncomment lines 60-62).
Using Maven:
Type 'mvn install' in this directory - this will build the library (by invoking 'make') and copy the
native library to the directory
${sting-utils.basedir}/src/main/resources/org/broadinstitute/sting/utils/pairhmm
The GATK maven build process (when run) will bundle the library into the StingUtils jar file from
the copied directory.
Simple build:
cd src/main/c++
make
Running:
The default implementation of PairHMM is now VECTOR_LOGLESS_CACHING in HaplotypeCaller.java. To use
the Java version, use the command line argument "--pair_hmm_implementation LOGLESS_CACHING". (see
run.sh in src/main/c++).
The native library is bundled with the StingUtils jar file. When HaplotypeCaller is invoked, then
the library is unpacked from the jar file, copied to the /tmp directory (with a unique id) and
loaded by the Java class VectorLoglessPairHMM in the constructor (if it has not been loaded
already).
The default library can be overridden by using the -Djava.library.path argument (see
src/main/c++/run.sh for an example) for the JVM to pass the path to the library. If the library
libVectorLoglessPairHMM.so can be found in java.library.path, then it is loaded and the 'packed'
library is not used.