How to call a host function in a CUDA kernel? - cuda

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed

Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.

The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...

While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20

I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

Related

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

NVRTC and __device__ functions

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
You can generate the appropriate "myFunc" kernel based on equations at run time.
Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

CUDA: calling a __host__ function() from a __global__ function() is not allowed [duplicate]

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

Is there a reason GPU/CPU pointers aren't more strongly typed?

Is there a reason the language designers didn't make pointers more strongly typed, so that the compiler could differentiate between a GPU-pointer and a CPU-pointer and eliminate the ridiculously common bug of mixing the two?
Is there ever a need to have a pointer refer to both a GPU-memory location and a CPU-memory location at once (is that even possible)?
Or is this just an incredibly glaring oversight in the design of the language?
[Edit] Example: C++/CLI has two different types of pointers, which cannot be mixed. They introduced separate notation so that this requirement could be enforced by the compiler:
int* a; //Normal pointer
int^ b; //Managed pointer
//pretend a is assigned here
b = a; //Compiler error!
Is there a reason (other than laziness/oversight) that CUDA does not do the same thing?
Nvidia's nvcc CUDA C "compiler" is not a full compiler, but a rather simple driver program that calls some other tools (cudafe and the C preprocessor) to separate host and device code, and feeds them to their respective compilers.
Only the device code compiler (cicc, or nvopencc in previous CUDA releases) is provided by Nvidia. The host portion of the code is just passed on to the hosts native C compiler, which frees Nvidia from the burden of providing a competitive compiler itself.
Generating error messages on improper pointer use would require parsing the host C code. While that would certainly be possible (teaching e.g. sparse or clang about the CUDA peculiarities), to my knowledge nobody has invested the effort into this so far.
Nvidia has written up a document on the NVIDIA CUDA Compiler Driver NVCC that explains the compilation process and the tools involved in more detail.
All pointers you define are stored in RAM. no matter if it is a GPU pointer or a CPU pointer. then you have to copy it yourself to GPU. there is no GPU nor CPU pointer. it is just a variable that holds an address to a location in a memory. Where you use it is important, if you are using it in a GPU then the GPU will search for that address in its accessible memory, it can be a location in RAM if you had pinned it to your graphic memory.
The most important thing is that you don't have direct access to a location in RAM because the address space in a CPU is virtual. your data might be stored on a hard drive, but this isn't the case on GPU. your memory address is a direct pass to the location. that makes it impossible to unify both address spaces.