I have a __global__ function in CUDA. Can it call itself?
Here is my example:
__global__ void
force_create_empty_nodes (struct NODE *Nodes, int topnode, int bits, int no, int x, int y,
int z, struct topnode_data *TopNodes)
{
/// * Some code *///
force_create_empty_nodes <<<1, 8>>>(Nodes, topnode+1, bits+1, no+1,
x+1, y+1, z+1, TopNodes);
}
And error I receive is:
error: kernel launch from __device__ or __global__ functions requires separate compilation mode
Here is my make command:
nvcc -c -arch compute_35 cudaForceNodes.cu -o obj/cudaForceNodes.o
Calling a kernel from another kernel is called dynamic parallelism. The documentation for it is here.
It requires:
A compute capability 3.5 device. You can find the compute capability of your device by running the cuda deviceQuery sample.
Various switches in the compile command, including those specifying compilation for a cc3.5 architecture and those needed for separate (device) compilation, and linking with the device runtime.
Since your GT550M is not a cc 3.5 device, you won't be able to use this feature. There is no other way to call a kernel from within a kernel.
Related
I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2
I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.
Can anyone describe the differences between __global__ and __device__ ?
When should I use __device__, and when to use __global__?.
Global functions are also called "kernels". It's the functions that you may call from the host side using CUDA kernel call semantics (<<<...>>>).
Device functions can only be called from other device or global functions. __device__ functions cannot be called from host code.
Differences between __device__ and __global__ functions are:
__device__ functions can be called only from the device, and it is executed only in the device.
__global__ functions can be called from the host, and it is executed in the device.
Therefore, you call __device__ functions from kernels functions, and you don't have to set the kernel settings. You can also "overload" a function, e.g : you can declare void foo(void) and __device__ foo (void), then one is executed on the host and can only be called from a host function. The other is executed on the device and can only be called from a device or kernel function.
You can also visit the following link: http://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions, it was useful for me.
__global__ - Runs on the GPU, called from the CPU or the GPU*. Executed with <<<dim3>>> arguments.
__device__ - Runs on the GPU, called from the GPU. Can be used with variabiles too.
__host__ - Runs on the CPU, called from the CPU.
*) __global__ functions can be called from other __global__ functions starting
compute capability 3.5.
I will explain it with an example:
main()
{
// Your main function. Executed by CPU
}
__global__ void calledFromCpuForGPU(...)
{
//This function is called by CPU and suppose to be executed on GPU
}
__device__ void calledFromGPUforGPU(...)
{
// This function is called by GPU and suppose to be executed on GPU
}
i.e. when we want a host(CPU) function to call a device(GPU) function, then 'global' is used. Read this: "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialGlobalFunctions"
And when we want a device(GPU) function (rather kernel) to call another kernel function we use 'device'. Read this "https://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions"
This should be enough to understand the difference.
__global__ is for cuda kernels, functions that are callable from the host directly. __device__ functions can be called from __global__ and __device__ functions but not from host.
__global__ function is the definition of kernel. Whenever it is called from CPU, that kernel is launched on the GPU.
However each thread executing that kernel, might require to execute some code again and again, for example swapping of two integers. Thus, here we can write a helper function, just like we do in a C program. And for threads executing on GPU, a helper function should be declared as __device__.
Thus, a device function is called from threads of a kernel - one instance for one thread . While, a global function is called from CPU thread.
I am recording some unfounded speculations here for the time being (I will substantiate these later when I come across some authoritative source)...
__device__ functions can have a return type other than void but __global__ functions must always return void.
__global__ functions can be called from within other kernels running on the GPU to launch additional GPU threads (as part of CUDA dynamic parallelism model (aka CNP)) while __device__ functions run on the same thread as the calling kernel.
__global__ is a CUDA C keyword (declaration specifier) which says that the function,
Executes on device (GPU)
Calls from host (CPU) code.
global functions (kernels) launched by the host code using <<< no_of_blocks , no_of threads_per_block>>>.
Each thread executes the kernel by its unique thread id.
However, __device__ functions cannot be called from host code.if you need to do it use both __host__ __device__.
Global Function can only be called from the host and they don't have a return type while Device Function can only be called from kernel function of other Device function hence dosen't require kernel setting
My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.
I'm using a GeForce 9800 GX2. I installed drivers and the CUDA SDK i wrote simple program which look s like this:
__global__ void myKernel(int *d_a)
{
int tx=threadIdx.x;
d_a[tx]+=1;
cuPrintf("Hello, world from the device!\n");
}
int main()
{
int *a=(int*)malloc(sizeof(int)*10);
int *d_a;
int i;
for(i=0;i<10;i++)
a[i]=i;
cudaPrintfInit();
cudaMalloc((void**)&d_a,10*sizeof(int));
cudaMemcpy(d_a,a,10*sizeof(int),cudaMemcpyHostToDevice);
myKernel<<<1,10>>>(d_a);
cudaPrintfDisplay(stdout, true);
cudaMemcpy(a,d_a,10*sizeof(int),cudaMemcpyDeviceToHost);
cudaPrintfEnd();
cudaFree(d_a);
}
The code is compiling properly, but the kernel appear not to be launching... No message is printed from the kernel side. What should I do to resolve this?
Given that in your comments you say you are getting "No CUDA-capable device" that implies that either you do not have a CUDA-capable GPU or that you do not have the correct driver installed. Given that you say you have both, I suggest you try reinstalling your driver to check.
Some other notes:
Are you trying to do this through Remote Desktop? That won't work since with RDP Microsoft uses a dummy display device in order to forward the display remotely, the Tesla GPUs support TCC which allows RDP to work by making the GPU behave as a non-display device, but with display GPUs like Geforce this is not possible. Either run at the console or login at the console and use VNC.
Also try running the deviceQuery SDK code sample to check that it detects your GPU and driver/runtime version correctly.
You should check all CUDA API calls for errors.
Call cudaDeviceSynchronize() before cudaPrintfDisplay().