Can I use cuda without using nvcc on my host code? - cuda

I'm writing a single header library that executes a cuda kernel. I was wondering if there is a way to get around the <<<>>> syntax, or get C source output from nvcc?

You can avoid the host language extensions by using the CUDA driver API instead. It is a little more verbose and you will require a little more boilerplate code to manage the context, but it is not too difficult.
Conventionally, you would compile to PTX or a binary payload to load at runtime, however NVIDIA now also ship an experimental JIT CUDA C compiler library, libNVVM, which you could try if you want JIT from source.

Related

Trouble in knowing when the cuda code gets compiled?

I want to know, when the cuda code gets compiled? I mean is it possible to know the values of parameters of the cuda kernel which is given in the command line argument of host code run time? Is it possible to compile cuda code during run time of host code ?
In typical usage of the CUDA runtime API, CUDA device code gets compiled when you pass a file containing CUDA device code to nvcc, the CUDA compiler driver engine.
CUDA device code can/will be compiled at run-time using either the driver API or using the CUDA NVRTC mechanism. There is documentation for each of these approaches, CUDA sample codes for each of these approaches, and various questions here on the cuda SO tag for each.
When you use the CUDA driver API, the device source code you will present for compilation at run-time is in the form of PTX, a CUDA intermediate language.
For compilation of typical CUDA C++ device code at runtime, you would use the NVRTC mechanism.

Check if SIMD machine is generated for LLVM IR

I have a C++ program that uses the LLVM libraries to generate an LLVM IR module and it compiles and executes it.
The code uses vector types and I want to check if it translates to SIMD instructions correctly on my architecture.
How do I find this out? Is there a way to see the assembly code that is generated out of this IR?
You're probably looking for some combination of -emit-llvm which outputs IR instead of native assembly, and -S which outputs assembly instead of object files.

Kernel mode GPGPU usage

Is it possible to run CUDA or OpenCL applications from a Linux kernel module?
I have found a project which is providing this functionality, but it needs a userspace helper in order to run CUDA programs. (https://code.google.com/p/kgpu/)
While this project already avoids redundant memory copying between user and kernel space I am wondering if it is possible to avoid the userspace completely?
EDIT:
Let me expand my question. I am aware that kernel components can only call the API provided by the kernel and other kernel components. So I am not looking to call OpenCL or CUDA API directly.
CUDA or OpenCL API in the end has to call into the graphics driver in order to make its magic happen. Most probably this interface is completely non-standard, changing with every release and so on....
But suppose that you have a compiled OpenCL or CUDA kernel that you would want to run. Do the OpenCL/CUDA userspace libraries do some heavy lifting before actually running the kernel or are they just lightweight wrappers around the driver interface?
I am also aware that the userspace helper is probably the best bet for doing this since calling the driver directly would most likely get broken with a new driver release...
The short answer is, no you can't do this.
There is no way to call any code which relies on glibc from kernel space. That implies that there is no way of making CUDA or OpenCL API calls from kernel space, because those libraries rely on glibc and a host of other user space helper libraries and user space system APIs which are unavailable in kernel space. CUDA and OpenCL aren't unique in this respect -- it is why the whole of X11 runs in userspace, for example.
A userspace helper application working via a simple kernel module interface is the best you can do.
[EDIT]
The runtime components of OpenCL are not lightweight wrappers around a few syscalls to push a code payload onto the device. Amongst other things, they include a full just in time compilation toolchain (in fact that is all that OpenCL has supported until very recently), internal ELF code and object management and a bunch of other stuff. There is very little likelihood that you could emulate the interface and functionality from within a driver at runtime.

compiling ptx code on NVIDIA GPU?

I want to intercept at PTX level of opencl programs on NVIDIA GPU.
I imagine the routine would probably look like this.
First, I write an opencl program (both host and device code), using NVIDIA compiler to produce respective ptx code. Then I write what I want to do by modifying the PTX code (please don't ask why I didn't do this on the device C code - I have some reasons for it). But problem is, after being modified, how do I compile this PTX code to binary code?
You can use ptxas, which is included in the CUDA toolkit. It compiles .ptx into .cubin, which can then be loaded with the driver API.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.