Is it possible to maintain one sources base to compile for CPU or GPU(make choice using building system)? Are there any pitfalls for this approach?
The Alpaka library could be a thing for you. The alpaka library is a header-only C++11 abstraction library for accelerator development. Its supports different accelerators like OpenMP, Boost.Fiber and CUDA. You need to implement your kernel one times. With template parameter you can choose your accelerator platform.
Related
Let's suppose that my chip doesn't support any API like keras, tensorflow or sklearn; however I need to implement a deep learning model in python.
Is it possible to make my training and testing model in python, then, I want to call the best model results for prediction with C++?
Where I mus save the resulted best model in order to be called in the next steps? Must I save it in the chip? Did I need to install tensorflow and keras in my chip in this case?
TERMINOLOGY
You seem to be confused about terminology. Here's a somewhat simplified overview.
Your chip is the hardware (CPU or GPU), and will include circuitry to support its instruction set (move data to/from local memory, perform math and logic operations, etc.). A CPU/GPU chip that cannot support your ML software is hard to visualize, and would not support Python or C++, either. The chip comes on a board, which includes a lot of peripheral connections, secondary memory, etc.
Then your operating system (basic software) is installed on the hardware. This OS manages resources: jobs, processes, memory allocation, etc. If there's a failure in support, it would be here, not in the chip. Finally, you install your desired applications (software tools, programs, etc.) as additions to the OS.
C++ and Python are two high-level languages, popular applications. These languages support Tensorflow and Keras (machine learning frameworks) and SciKit (scientific / statistical package; sklearn is the package name you import).
DIRECT ANSWER
Yes, you can write your NN in Python. Yes, you can call it from C++. Python depends on C/C++ libraries; there is a viable interface between the two.
There is no particular method you must use to save your model and call it later: if you're writing your own model in Python, you get to decide the storage format and location. All you need is to have your Python and C++ programs "agree" on the format. Since you're writing them both, then you can choose whatever works for you.
RECOMMENDATION
Don't write these yourself, unless you really want the exercise. Instead, install a framework (TensorFlow, Caffe, Neon, Torch, MXNet, Keras, ...). Then, simply follow the given tutorials to learn how to build, save, and restore your model.
I have been working on programming a convolutional back-propagation neural network recently and I have mainly been using Java to run the program and libGDX for the graphical visualizations. Through heavy research, I have found that to heavily increase performance and efficiency, I should preform the matrix calculations on the graphics card instead of on the CPU.
After looking through sources online, I found that the main way to preform such calculations on the graphics card was through OpenCl. After even more research, I discovered that my main two options for OpenCl support on Java was through LWJGL or JOCL.
libGDX was built on LWJGL, so my first instinct was to see if I could access that built in OpenCL support through the libGDX library, however, after looking around, I found nothing about this whatsoever!
My question is, can I access OpenCl through the libGDX library, and if so, how?
If I can't access LWJGL's OpenCl implementation, should I use JOCL to access GPU mathematical computations, or should I add a second library of LWJGL into my libGDX application?
Not sure if it's in Lwjgl2 in GDX, but I know the LibGDX Lwjgl3 implementation does not include it. But Lwjgl3 is broken up into modules, so you can add the OpenCL module in your Gradle project.
In "core" dependencies, add
compile "org.lwjgl:lwjgl-opencl:3.1.0"
What I don't know is if this OpenCL module has any dependencies on the core of Lwjgl3. If so, you might want to switch to the LibGDX Lwjgl3 backend. To switch to Lwjgl3, in the "desktop" dependencies, add 3 after lwjgl3, so:
compile "com.badlogicgames.gdx:gdx-backend-lwjgl3:$gdxVersion"
If you switch to Lwjgl3, you have to clean up some of the DesktopLauncher imports and class names, basically adding 3 after Lwjgl in the class names (scroll down here for instructions if you need them).
You may have to keep the version number in sync with the version of Lwjgl3 the LibGDX version is using.
I'm planning to use GPU to do an application with intensive matrix manipulation. I want to use the CUDA NVIDIA support. My only doubt is: is there any fallback support? I mean: if I use these libraries I've got the possibility to run the application in non-CUDA environment (without gpu support, of course)? I'd like to have the possibility to debug the application without the constraint to use that environment. I didn't find this information, any tips?
There is no fallback support built into the libraries (e.g. CUBLAS, CUSPARSE, CUFFT). You would need to have your code develop a check for an existing CUDA environment, and if it finds none, then develop your own code path, perhaps using alternate libraries. For example, CUBLAS functions can be mostly duplicated by other BLAS libraries (e.g. MKL). CUFFT functions can be largely replaced by other FFT libraries (e.g. FFTW).
How to detect a CUDA environment is covered in other SO questions. In a nutshell, if your application bundles (e.g. static-links) the CUDART library, then you can run a procedure similar to that in the deviceQuery sample code, to determine what GPUs (if any) are available.
I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!
Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.
I am designing a multi-gpu cuda code but I still don't have the machinary to actually develop the code. So, until I do,
Do you know if there is someway to emulate a multiple gpu enviroment just by using one gpu?
I suppose that such a thing, if it exists, would be very limited but it would allow me to test my ideas until I get the hardware I want.
Thanks!
Something close can be approximated using the CUDA Driver API (cuCtxCreate, cuCtxSetCurrent). See CUDA C Programming Guide Appendix G.4 Interoperability between Runtime and Driver API. Before calling any cuda* functions use cuCtxCreate to create two contexts on the device. Use cuCtxSetCurrent in place of cudaSetDevice.