Is there any drawback or overhead of defining function with both
__host__ __device__
qualifier instead of just
__device__
?
There won't be any drawbacks. If you call the generated binary code for your host function overhead than yes, there will be an overhead increasing your programm size.
The nvcc compiler driver will build one device function callable from __global__ and utilize the host compiler to generate one version of your function for host code. Thats it.
Related
I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.
I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.
In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device.
However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch).
Can anyone confirm this is correct?
__CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled.
It is not intended to be used in host code. From the nvcc manual:
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
Usage of __CUDA_ARCH__ in host code is therefore undefined (at least by CUDA). As pointed out by #tera in the comments, since the macro is undefined in host code, it could be used to differentiate host/device paths for example, in a __host__ __device__ function definition.
#ifndef __CUDA_ARCH__
//host code here
#else
//device code here
#endif
Can anyone describe the differences between __global__ and __device__ ?
When should I use __device__, and when to use __global__?.
Global functions are also called "kernels". It's the functions that you may call from the host side using CUDA kernel call semantics (<<<...>>>).
Device functions can only be called from other device or global functions. __device__ functions cannot be called from host code.
Differences between __device__ and __global__ functions are:
__device__ functions can be called only from the device, and it is executed only in the device.
__global__ functions can be called from the host, and it is executed in the device.
Therefore, you call __device__ functions from kernels functions, and you don't have to set the kernel settings. You can also "overload" a function, e.g : you can declare void foo(void) and __device__ foo (void), then one is executed on the host and can only be called from a host function. The other is executed on the device and can only be called from a device or kernel function.
You can also visit the following link: http://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions, it was useful for me.
__global__ - Runs on the GPU, called from the CPU or the GPU*. Executed with <<<dim3>>> arguments.
__device__ - Runs on the GPU, called from the GPU. Can be used with variabiles too.
__host__ - Runs on the CPU, called from the CPU.
*) __global__ functions can be called from other __global__ functions starting
compute capability 3.5.
I will explain it with an example:
main()
{
// Your main function. Executed by CPU
}
__global__ void calledFromCpuForGPU(...)
{
//This function is called by CPU and suppose to be executed on GPU
}
__device__ void calledFromGPUforGPU(...)
{
// This function is called by GPU and suppose to be executed on GPU
}
i.e. when we want a host(CPU) function to call a device(GPU) function, then 'global' is used. Read this: "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialGlobalFunctions"
And when we want a device(GPU) function (rather kernel) to call another kernel function we use 'device'. Read this "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions"
This should be enough to understand the difference.
__global__ is for cuda kernels, functions that are callable from the host directly. __device__ functions can be called from __global__ and __device__ functions but not from host.
__global__ function is the definition of kernel. Whenever it is called from CPU, that kernel is launched on the GPU.
However each thread executing that kernel, might require to execute some code again and again, for example swapping of two integers. Thus, here we can write a helper function, just like we do in a C program. And for threads executing on GPU, a helper function should be declared as __device__.
Thus, a device function is called from threads of a kernel - one instance for one thread . While, a global function is called from CPU thread.
I am recording some unfounded speculations here for the time being (I will substantiate these later when I come across some authoritative source)...
__device__ functions can have a return type other than void but __global__ functions must always return void.
__global__ functions can be called from within other kernels running on the GPU to launch additional GPU threads (as part of CUDA dynamic parallelism model (aka CNP)) while __device__ functions run on the same thread as the calling kernel.
__global__ is a CUDA C keyword (declaration specifier) which says that the function,
Executes on device (GPU)
Calls from host (CPU) code.
global functions (kernels) launched by the host code using <<< no_of_blocks , no_of threads_per_block>>>.
Each thread executes the kernel by its unique thread id.
However, __device__ functions cannot be called from host code.if you need to do it use both __host__ __device__.
Global Function can only be called from the host and they don't have a return type while Device Function can only be called from kernel function of other Device function hence dosen't require kernel setting
Is there any difference and what is the best way to define device constants in a CUDA program? In the C++, host/device program if I want to define constants to be in device constant memory I can do either
__device__ __constant__ float a = 5;
__constant__ float a = 5;
Question 1. On devices 2.x and CUDA 4, is it the same as,
__device__ const float a = 5;
Question 2. Why is it that in PyCUDA SourceModule("""..."""), which compiles only do device code, even the following works?
const float a = 5;
In CUDA __constant__is a variable type qualifier that indicates the variable being declared is to be stored in device constant memory. Quoting section B 2.2 of the CUDA programming guide
The __constant__ qualifier, optionally used together with __device__,
declares a variable that:
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads
within the grid and from the host through the runtime library
(cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() /
cudaMemcpyFromSymbol() for the runtime API and cuModuleGetGlobal() for
the driver API).
In CUDA, constant memory is a dedicated, static, global memory area accessed via a cache (there are a dedicated set of PTX load instructions for its purpose) which are uniform and read-only for all threads in a running kernel. But the contents of constant memory can be modified at runtime through the use of the host side APIs quoted above. This is different from declaring a variable to the compiler using the const declaration, which is adding a read-only characteristic to a variable at the scope of the declaration. The two are not at all the same thing.