Mathematica and CUDA - cuda

Is it possible that built in functions in Mathematica (like Minimize[expr,{x1,x2,...}]) will start to work via CUDA after installation of CUDA module for Mathematica?

I don't believe so, no. Mathematica's CUDALink module currently provides only a handful of GPU accelerated functions - some basic image processing operations, BLAS style linear algebra calls, Fourier Transforms and simple parallel reductions (argmin, argmax, and summation). There is also tools for integrating user written CUDA code, and for generating CUDA code symbolically. Outside of that, the rest of Mathematica's core functionality remains CPU only.
You can see full details of current CUDA and OpenCL support here.

Related

CUDA lapack librairies (CULA & MAGMA) as device functions [duplicate]

So I'm trying to see if I can get some significant speedup from using a GPU to solve a small overdetermined system of equations by solving a bunch at the same time. My current algorithm involves using an LU decomposition function from the CULA Dense library that also has to switch back and forth between the GPU and the CPU to initialize and run the CULA functions. I would like to be able to call the CULA functions from my CUDA kernels so that I don't have to jump back to the CPU and copy the data back. This would also allow me to create multiple threads that are working on different data sets to be solving multiple systems concurrently. My question is can I call CULA functions from device functions? I know it's possible with CUBLAS and some of the other CUDA libraries.
Thanks!
The short answer is no. The CULA library routines are designed to be called from host code, not device code.
Note that CULA have their own support forums here which you may be interested in.

Kernel mode GPGPU usage

Is it possible to run CUDA or OpenCL applications from a Linux kernel module?
I have found a project which is providing this functionality, but it needs a userspace helper in order to run CUDA programs. (https://code.google.com/p/kgpu/)
While this project already avoids redundant memory copying between user and kernel space I am wondering if it is possible to avoid the userspace completely?
EDIT:
Let me expand my question. I am aware that kernel components can only call the API provided by the kernel and other kernel components. So I am not looking to call OpenCL or CUDA API directly.
CUDA or OpenCL API in the end has to call into the graphics driver in order to make its magic happen. Most probably this interface is completely non-standard, changing with every release and so on....
But suppose that you have a compiled OpenCL or CUDA kernel that you would want to run. Do the OpenCL/CUDA userspace libraries do some heavy lifting before actually running the kernel or are they just lightweight wrappers around the driver interface?
I am also aware that the userspace helper is probably the best bet for doing this since calling the driver directly would most likely get broken with a new driver release...
The short answer is, no you can't do this.
There is no way to call any code which relies on glibc from kernel space. That implies that there is no way of making CUDA or OpenCL API calls from kernel space, because those libraries rely on glibc and a host of other user space helper libraries and user space system APIs which are unavailable in kernel space. CUDA and OpenCL aren't unique in this respect -- it is why the whole of X11 runs in userspace, for example.
A userspace helper application working via a simple kernel module interface is the best you can do.
[EDIT]
The runtime components of OpenCL are not lightweight wrappers around a few syscalls to push a code payload onto the device. Amongst other things, they include a full just in time compilation toolchain (in fact that is all that OpenCL has supported until very recently), internal ELF code and object management and a bunch of other stuff. There is very little likelihood that you could emulate the interface and functionality from within a driver at runtime.

Is just-in-time (jit) compilation of a CUDA kernel possible?

Does CUDA support JIT compilation of a CUDA kernel?
I know that OpenCL offers this feature.
I have some variables which are not changed during runtime (i.e. only depend on the input file), therefore I would like to define these values with a macro at kernel compile time (i.e at runtime).
If I define these values manually at compile time my register usage drops from 53 to 46, what greatly improves performance.
It became available with nvrtc library of cuda 7.0. By this library you can compile your cuda codes during runtime.
http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/
Bu what kind of advantages you can gain? In my view, i couldn't find so much dramatic advantages of dynamic compilation.
If it is feasible for you to use Python, you can use the excellent pycuda module to compile your kernels at runtime. Combined with a templating engine such as Mako, you will have a very powerful meta-programming environment that will allow you to dynamically tune your kernels for whatever architecture and specific device properties happen to be available to you (obviously some things will be difficult to make fully dynamic and automatic).
You could also consider just maintaining a few distinct versions of your kernel with different parameters, between which your program could choose at runtime based on whatever input you are feeding to it.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.

How to accelerate a C++ code using CUDA

I had done my physics simulation project using C++ , OpenGL in Visual Studio 10. Later I had used OpenMP for CPU Parallelization. Now I want to accelerate my C++ code to CUDA so that I can achieve higher performance. Is it possible to convert my code into CUDA or any GPU devices?
Cuda and C++ are different programming languages (even if they look syntactically similar) with different programming paradigm.
You'll have to recode, and perhaps even redesign, your project to take advantage of Cuda (or of OpenCL).
Actually, you'll need to define what are the numerical kernels that might take advantage of your GPGPU and then recode these kernels (in Cuda, or in OpenCL); you'll also have to write some glue code to make all this work together.
You can determine which parts of your project can be parallelized and then reimplement these parts in Cuda. You can take a look at Fast N-Body Simulation with CUDA.