Is just-in-time (jit) compilation of a CUDA kernel possible? - cuda

Does CUDA support JIT compilation of a CUDA kernel?
I know that OpenCL offers this feature.
I have some variables which are not changed during runtime (i.e. only depend on the input file), therefore I would like to define these values with a macro at kernel compile time (i.e at runtime).
If I define these values manually at compile time my register usage drops from 53 to 46, what greatly improves performance.

It became available with nvrtc library of cuda 7.0. By this library you can compile your cuda codes during runtime.
http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/
Bu what kind of advantages you can gain? In my view, i couldn't find so much dramatic advantages of dynamic compilation.

If it is feasible for you to use Python, you can use the excellent pycuda module to compile your kernels at runtime. Combined with a templating engine such as Mako, you will have a very powerful meta-programming environment that will allow you to dynamically tune your kernels for whatever architecture and specific device properties happen to be available to you (obviously some things will be difficult to make fully dynamic and automatic).
You could also consider just maintaining a few distinct versions of your kernel with different parameters, between which your program could choose at runtime based on whatever input you are feeding to it.

Related

Fallback support nvidia libraries

I'm planning to use GPU to do an application with intensive matrix manipulation. I want to use the CUDA NVIDIA support. My only doubt is: is there any fallback support? I mean: if I use these libraries I've got the possibility to run the application in non-CUDA environment (without gpu support, of course)? I'd like to have the possibility to debug the application without the constraint to use that environment. I didn't find this information, any tips?
There is no fallback support built into the libraries (e.g. CUBLAS, CUSPARSE, CUFFT). You would need to have your code develop a check for an existing CUDA environment, and if it finds none, then develop your own code path, perhaps using alternate libraries. For example, CUBLAS functions can be mostly duplicated by other BLAS libraries (e.g. MKL). CUFFT functions can be largely replaced by other FFT libraries (e.g. FFTW).
How to detect a CUDA environment is covered in other SO questions. In a nutshell, if your application bundles (e.g. static-links) the CUDART library, then you can run a procedure similar to that in the deviceQuery sample code, to determine what GPUs (if any) are available.

Kernel mode GPGPU usage

Is it possible to run CUDA or OpenCL applications from a Linux kernel module?
I have found a project which is providing this functionality, but it needs a userspace helper in order to run CUDA programs. (https://code.google.com/p/kgpu/)
While this project already avoids redundant memory copying between user and kernel space I am wondering if it is possible to avoid the userspace completely?
EDIT:
Let me expand my question. I am aware that kernel components can only call the API provided by the kernel and other kernel components. So I am not looking to call OpenCL or CUDA API directly.
CUDA or OpenCL API in the end has to call into the graphics driver in order to make its magic happen. Most probably this interface is completely non-standard, changing with every release and so on....
But suppose that you have a compiled OpenCL or CUDA kernel that you would want to run. Do the OpenCL/CUDA userspace libraries do some heavy lifting before actually running the kernel or are they just lightweight wrappers around the driver interface?
I am also aware that the userspace helper is probably the best bet for doing this since calling the driver directly would most likely get broken with a new driver release...
The short answer is, no you can't do this.
There is no way to call any code which relies on glibc from kernel space. That implies that there is no way of making CUDA or OpenCL API calls from kernel space, because those libraries rely on glibc and a host of other user space helper libraries and user space system APIs which are unavailable in kernel space. CUDA and OpenCL aren't unique in this respect -- it is why the whole of X11 runs in userspace, for example.
A userspace helper application working via a simple kernel module interface is the best you can do.
[EDIT]
The runtime components of OpenCL are not lightweight wrappers around a few syscalls to push a code payload onto the device. Amongst other things, they include a full just in time compilation toolchain (in fact that is all that OpenCL has supported until very recently), internal ELF code and object management and a bunch of other stuff. There is very little likelihood that you could emulate the interface and functionality from within a driver at runtime.

differences between virtual and real architecture of cuda

Trying to understand the differences between virtual and real architecture of cuda, and how the different configurations will affect the performance of the program, e.g.
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
...
The following explanation was given in NVCC manual,
GPU compilation is performed via an intermediate representation, PTX
([...]), which can
be considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a compute
architecture to specify the virtual intermediate architecture, plus a real GPU architecture
to specify the intended processor to execute on. For such an nvcc command to be valid,
the real architecture must be an implementation (someway or another) of the virtual
architecture. This is further explained below.
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
But still don't quite get how the performance will be affected by different configurations (or, maybe only affect the selection of the physical GPU devices?). In particular, this statement is most confusing to me:
In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
The NVIDIA CUDA Compiler Driver NVCC User Guide Section on GPU Compilation provides a very thorough description of virtual and physical architecture and how the concepts are used in the build process.
The virtual architecture specifies the feature set that is targeted by the code. The table listed below shows some of the evolution of the virtual architecture. When compiling you should specify the lowest virtual architecture that has a sufficient feature set to enable the program to be executed on the widest range of physical architectures.
Virtual Architecture Feature List (from the User Guide)
compute_10 Basic features
compute_11 + atomic memory operations on global memory
compute_12 + atomic memory operations on shared memory
+ vote instructions
compute_13 + double precision floating point support
compute_20 + Fermi support
compute_30 + Kepler support
The physical architecture specifies the implementation of the GPU. This provides the compiler with the instruction set, instruction latency, instruction throughput, resource sizes, etc. so that the compiler can optimally translate the virtual architecture to binary code.
It is possible to specify multiple virtual and physical architecture pairs to the compiler and have the compiler back the final PTX and binary into a single binary. At runtime the CUDA driver will choose the best representation for the physical device that is installed. If binary code is not provided in the fatbinary file the driver can use the JIT runtime for the best PTX implementation.
"Virtual architecture" code will get compiled by a just-in-time compiler before being loaded on the device. AFAIK, it is the same compiler as the one NVCC invokes when building "physical architecture" code offline - so I don't know if there will be any differences in the resulting application performance.
Basically, every generation of the CUDA hardware is binary incompatible with previous generation - imagine next generation of Intel processors sporting ARM instruction set. This way, virtual architectures provide an intermediate representation of the CUDA application that can be compiled for compatible hardware. Every hardware generation introduces new features (e.g. atomics, CUDA Dynamic Parallelism) that require new instructions - that's why you need new virtual architectures.
Basically, if you want to use CDP you should compile for SM 3.5. You can compile it to device binary that will have assembly code for specific CUDA device generation or you can compile it to PTX code that can be compiled into device assembly for any device generation that provides these features.
The virtual architecture specifies what capabilities a GPU has and the real architecture specifies how it does it.
I can't think of any specific examples off hand. A (probably not correct) example may be a virtual GPU specifying the number of cores a card has, so code is generated targeting that number of cores, whereas the real card may have a few more for redundancy (or a few less due to manufacturing errors) and some methods of mapping to the cores that are actually in use, which can be placed on top of the more generic code generated in the first step.
You can think of the PTX code sort of like assembly code, which targets a certain architecture, which can then be compiled to machine code for a specific processor. Targeting the assembly code for the right kind of processor will, in general, generate better machine code.
well usually what nvidia writes as document causes people (including myself) to become more confused! (just me maybe!)
you are concerned with the performance, basically what this says is that don't be (probably) but you should.basically the GPU architecture is like nature. they run something on it and something happens. then they try to explain it. and then they feed it to you.
at the end should probably run some tests and see what configuration gives the best result.
the virtual architecture is what is designed to let you think freely. you should obey that, use as much as threads as you want, you can assign virtually everything as number of threads and blocks, doesn't matter, it will be translated to PTX and the device will run it.
the only problem is, if you assign more than 1024 threads per a single block you will get 0 s as the result, because the device(the real architecture) doesn't support it.
or for example your device support the CUDA 1.2, you can define double pointing variables in your code, but again you will get 0 s as the result because simply the device can't run it.
performance wise you have to know that every 32 thread (e.g. warps) have to access a single position in memory or else your access will be serialized and so on.
So I hope you've got the point by now, It is a relatively new science and GPU is a really sophisticated piece of hardware architecture, everybody is trying to make the best of it but it's a game of testing and a little knowledge of actual architecture behind CUDA. I suggest that search for GPU architecture and see how the virtual threads and thread blocks are actually implemented.

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the accelerated CUDA code. What I want is to first check whether the user's system has a CUDA enabled NVIDIA card and if it does, initializing the CUDA runtime after. If the system does not support CUDA, then I want to execute the CPU path. This question is very similar to mine, but I don't want to use any other libraries other than the plain CUDA runtime. OpenCL is an alternative, but there isn't enough time to implement an OpenCL version of the algorithm for the first release. Without any CUDA existence check, the program will surely crash since it can't find the needed .dll's for the CUDA runtime and we surely don't want that. So, I need advices on how to handle this initialization step.
Use the calls cudaGetDeviceCount and cudaGetDeviceProperties to find CUDA devices on the running system. First find out how many, then loop through all the available devices, and inspect the properties to decide which ones qualify. What I mean by "qualify" depends on your application. Do you want to require a certain compute capability? Or need a certain amount of memory? If there's more than one device, you might want to sort on some criteria, then set the device cudaSetDevice. If there are no devices, or none that are sufficient, then fall back on the CPU code path.
I'd also suggest having some mechanism to force CUDA mode off, in case some user's environment just doesn't work due to driver issues, or an old board, or something else. You can use a command-line option, or an environment variable, or whatever...
EDITING:
Regarding DLLs, you should package cudart[whatever].dll with your application. That will ensure that the program starts, and at least the CUDA query functions will operate.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.