Difference between global and device functions - cuda

Can anyone describe the differences between __global__ and __device__ ?
When should I use __device__, and when to use __global__?.

Global functions are also called "kernels". It's the functions that you may call from the host side using CUDA kernel call semantics (<<<...>>>).
Device functions can only be called from other device or global functions. __device__ functions cannot be called from host code.

Differences between __device__ and __global__ functions are:
__device__ functions can be called only from the device, and it is executed only in the device.
__global__ functions can be called from the host, and it is executed in the device.
Therefore, you call __device__ functions from kernels functions, and you don't have to set the kernel settings. You can also "overload" a function, e.g : you can declare void foo(void) and __device__ foo (void), then one is executed on the host and can only be called from a host function. The other is executed on the device and can only be called from a device or kernel function.
You can also visit the following link: http://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions, it was useful for me.

__global__ - Runs on the GPU, called from the CPU or the GPU*. Executed with <<<dim3>>> arguments.
__device__ - Runs on the GPU, called from the GPU. Can be used with variabiles too.
__host__ - Runs on the CPU, called from the CPU.
*) __global__ functions can be called from other __global__ functions starting
compute capability 3.5.

I will explain it with an example:
main()
{
// Your main function. Executed by CPU
}
__global__ void calledFromCpuForGPU(...)
{
//This function is called by CPU and suppose to be executed on GPU
}
__device__ void calledFromGPUforGPU(...)
{
// This function is called by GPU and suppose to be executed on GPU
}
i.e. when we want a host(CPU) function to call a device(GPU) function, then 'global' is used. Read this: "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialGlobalFunctions"
And when we want a device(GPU) function (rather kernel) to call another kernel function we use 'device'. Read this "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions"
This should be enough to understand the difference.

__global__ is for cuda kernels, functions that are callable from the host directly. __device__ functions can be called from __global__ and __device__ functions but not from host.

__global__ function is the definition of kernel. Whenever it is called from CPU, that kernel is launched on the GPU.
However each thread executing that kernel, might require to execute some code again and again, for example swapping of two integers. Thus, here we can write a helper function, just like we do in a C program. And for threads executing on GPU, a helper function should be declared as __device__.
Thus, a device function is called from threads of a kernel - one instance for one thread . While, a global function is called from CPU thread.

I am recording some unfounded speculations here for the time being (I will substantiate these later when I come across some authoritative source)...
__device__ functions can have a return type other than void but __global__ functions must always return void.
__global__ functions can be called from within other kernels running on the GPU to launch additional GPU threads (as part of CUDA dynamic parallelism model (aka CNP)) while __device__ functions run on the same thread as the calling kernel.

__global__ is a CUDA C keyword (declaration specifier) which says that the function,
Executes on device (GPU)
Calls from host (CPU) code.
global functions (kernels) launched by the host code using <<< no_of_blocks , no_of threads_per_block>>>.
Each thread executes the kernel by its unique thread id.
However, __device__ functions cannot be called from host code.if you need to do it use both __host__ __device__.

Global Function can only be called from the host and they don't have a return type while Device Function can only be called from kernel function of other Device function hence dosen't require kernel setting

Related

How to redefine malloc/free in CUDA?

I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.

Ensure that thrust doesnt memcpy from host to device

I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.

Set the number of blocks and threads in calling a device function in CUDA?

I have a basic question about calling a device function from a global CUDA kernel. Can we specify the number of blocks and threads when I want to call a device function???
I post an question earlier about min reduction (here) and I want to call this function inside another global kernel. However the reduction code needs certain blocks and threads.
There are two types of functions that can be called on the device:
__device__ functions are like ordinary c or c++ functions: they operate in the context of a single (CUDA) thread. It's possible to call these from any number of threads in a block, but from the standpoint of the function itself, it does not automatically create a set of threads like a kernel launch does.
__global__ functions or "kernels" can only be called using a kernel launch method (e.g. my_kernel<<<...>>>(...); in the CUDA runtime API). When calling a __global__ function via a kernel launch, you specify the number of blocks and threads to launch as part of the kernel configuration (<<<...>>>). If your GPU is of compute capability 3.5 or higher, then you can also call a __global__ function from device code (using essentially the same kernel launch syntax, which allows you to specify blocks and threads for the "child" kernel). This employs CUDA Dynamic Parallelism which has a whole section of the programming guide dedicated to it.
There are many CUDA sample codes that demonstrate:
calling a __device__ function, such as simpleTemplates
calling a __global__ function from the device, such as cdpSimplePrint

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

Overhead/drawback of defining a function with both __device__ and __host__ qualifiers?

Is there any drawback or overhead of defining function with both
__host__ __device__
qualifier instead of just
__device__
?
There won't be any drawbacks. If you call the generated binary code for your host function overhead than yes, there will be an overhead increasing your programm size.
The nvcc compiler driver will build one device function callable from __global__ and utilize the host compiler to generate one version of your function for host code. Thats it.