CUDA kernel doesn't launch - cuda

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?

The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}

Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.

In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.

Related

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

Does it take very long time to load CUDA program on Tesla C1060?

I had my CUDA code run on Linux server,RHEL5.3/Tesla C1060/CUDA 2.3 but it is much slower than I expect
However the data from cuda profiler is fast enough
So it seems that it spent very long time to load the program and the time isn't profiled
Am I right?
I use such code to test whether I'm right
#include<cuda.h>
#include<cuda_runtime.h>
#include<stdio.h>
#include<time.h>
#define B 1
#define T 1
__global__ void test()
{
}
int main()
{
clock_t start=clock();
cudaSetDevice(0);
test<<<B,T>>>();
clock_t end=clock();
printf("time:%dms\n",end-start);
}
and use the command "time" as well as the clock() funtction used in the code to measure the
time
nvcc -o test test.cu
time ./test
the result is
time:4s
real 0m3.311s
user 0m0.005s
sys 0m2.837s
on my own PC,which is Win 8/CUDA5.5/GT 720M/, the same code runs much faster.
The Linux CUDA driver of that era (probably 185 series IIRC) had a "feature" whereby the driver would unload several internal driver components whenever there was not a client connected to the driver. With display GPUs where X11 was active at all times, this was rarely a problem, but for compute GPUs it lead to large latency on first application runs while the driver reinitialised itself, and loss of device settings such as compute exclusive mode, fan speed, etc.
The normal solution was to run the nvidia-smi utility in daemon mode - it acts as a client and stops the the driver from deintialising. Something like this:
nvidia-smi --loop-continuously --interval=60 --filename=/var/log/nvidia-smi.log &
run as root should solve the problem

OpenCL application with 3 different kernels

I just started with OpenCL and I want to port an app I have in CUDA. The problem I'm facing now is the kernel stuff.
In CUDA I have all my kernel functions in the same file, on the contrary, OpenCL asks to read the file with the kernel source code and then do some other stuff.
My question is: Can I have one single file with all my kernel functions and then build the program in OpenCL OR I have to have one file for each of my kernel functions?
It would be nice if you give a little example.
The only difference between OpenCL and CUDA (in this specific regard) is that CUDA allows to mix device with host code in the same source file, while OpenCL requires you to load the program source as an external string and compile it at runtime.
But nevertheless it is absolutely no problem to put many kernel functions into a single OpenCL program, or even into a single OpenCL program source string. The kernels (say the C API kernel objects) themselves are then extracted from the program object using their respective function names.
pseudocode simplifying OpenCL's ugly C interface:
single OpenCL file:
__kernel void a(...) {}
__kernel void b(...) {}
C file:
source = read_cl_file();
program = clCreateProgramWithSource(source);
clBuildProgram(program);
kernel_a = clCreateKernel(program, "a");
kernel_b = clCreateKernel(program, "b");

Calling a kernel from a kernel

A follow up Q from: CUDA: Calling a __device__ function from a kernel
I'm trying to speed up a sort operation. A simplified pseudo version follows:
// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
float saveData; // swap some
saveData= *Adata; // big complex
*Adata= *Bdata // data chunk
*Bdata= saveData;
}
// a rather simple sort operation
__global__ sort(float data[]){
for (i=0; i<limit: i++){
find left swap point
find right swap point
swap<<<1,1>>>(left, right);
}
}
(Note: This simple version doesn't show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?
It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.
Solution:
I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.
__global__ void kernel_child(float *var1, int N){
//do data operations here
}
__global__ void kernel_parent(float *var1, int N)
{
kernel_child<<<1,2>>>(var1,N);
}
The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.
Terminal way
You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.
$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out
Hope it helps. Thank you!
[1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf
Edit (2016):
Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.
Original Answer:
You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.