CUDA kernel not launching - cuda

I'm using a GeForce 9800 GX2. I installed drivers and the CUDA SDK i wrote simple program which look s like this:
__global__ void myKernel(int *d_a)
{
int tx=threadIdx.x;
d_a[tx]+=1;
cuPrintf("Hello, world from the device!\n");
}
int main()
{
int *a=(int*)malloc(sizeof(int)*10);
int *d_a;
int i;
for(i=0;i<10;i++)
a[i]=i;
cudaPrintfInit();
cudaMalloc((void**)&d_a,10*sizeof(int));
cudaMemcpy(d_a,a,10*sizeof(int),cudaMemcpyHostToDevice);
myKernel<<<1,10>>>(d_a);
cudaPrintfDisplay(stdout, true);
cudaMemcpy(a,d_a,10*sizeof(int),cudaMemcpyDeviceToHost);
cudaPrintfEnd();
cudaFree(d_a);
}
The code is compiling properly, but the kernel appear not to be launching... No message is printed from the kernel side. What should I do to resolve this?

Given that in your comments you say you are getting "No CUDA-capable device" that implies that either you do not have a CUDA-capable GPU or that you do not have the correct driver installed. Given that you say you have both, I suggest you try reinstalling your driver to check.
Some other notes:
Are you trying to do this through Remote Desktop? That won't work since with RDP Microsoft uses a dummy display device in order to forward the display remotely, the Tesla GPUs support TCC which allows RDP to work by making the GPU behave as a non-display device, but with display GPUs like Geforce this is not possible. Either run at the console or login at the console and use VNC.
Also try running the deviceQuery SDK code sample to check that it detects your GPU and driver/runtime version correctly.
You should check all CUDA API calls for errors.
Call cudaDeviceSynchronize() before cudaPrintfDisplay().

Related

Get device properties in CUDA 6.5

I am writing a program that can get and display all information (properties) about GPU device in CUDA 6.5 (C++). But when I run, it does not show the device name as I want and maximum number of threads per block is 1.
I used GPU EN9400GT ASUS.
EN9400GT ASUS uses GeForce 9400GT and its compute capability is 1.0. CUDA 6.5 dropped support for cc1.0 so your code won't work. You should use CUDA 6.0 for cc1.0 devices (link).
You could have found out this by yourself if you had used correct error checking code for the CUDA APIs. When checking the return value of a CUDA API, you should compare the return value with cudaSuccess, not with an arbitrary integer value. If you had compared GPUAvail with cudaSuccess like this:
if (GPUAvail != cudaSuccess)
exit(EXIT_FAILURE);
then your program would have stopped. See this article for proper error checking method.
Also, check out deviceQuery CUDA sample code. This sample code does what you are trying to do.

CUDA - the launch timed out and was terminated - Ubuntu and no display

I am using a workstation containing 4 GeForce GTX Titan black cards for CUDA development. I am working on Ubuntu 12.04.5 and none of these GPUs are used for display. I notice using cudaGetDeviceProperties that kernel execution timeout is enabled. Does this apply when I am not on Windows and not using a display?
I put the following code to test this in one of my kernels which normally runs fine:
__global__ void update1(double *alpha_out, const double *sDotZ, const double *rho, double, *minusAlpha_out, clock_t *global_now)
{
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 50000000000) {
break;
}
}
*global_now = now;
}
The kernel launch looks like:
update1<<<1, 1>>>(d_alpha + idx, d_tmp, d_rho + idx, d_tmp, global_now);
CudaCheckError();
cudaDeviceSynchronize();
For a large enough number of cycles waiting, I see the error:
CudaCheckError() with sync failed at /home/.../xxx.cu:295:
the launch timed out and was terminated
It runs fine for a small number of cycles. If I run this same code on a Tesla K20m GPU with kernel execution timeout disabled, I do not see this error and the program runs as normal. If I see this error, does it definitely mean I am hitting the kernel time limit that appears to be enabled or could there be something else wrong with my code? All mentions of this problem seem to be by people using Windows or also using their card for display so how is it possible I am seeing this error?
Linux has a display watchdog as well. On Ubuntu, in my experience, it is active for display devices that are configured via xorg.conf (e.g. /etc/X11/xorg.conf, but the exact configuration method will vary by distro and version).
So yes, it is possible to see the kernel execution timeout error on Linux.
In general, you can work around it in several ways, but since you have multiple GPUs, the best approach is to remove the GPUs that you want to do compute tasks on, from your display configuration (e.g. xorg.conf or whatever), and then run your compute tasks on those. Once X is not configured to use a particular GPU, that GPU won't have any watchdog associated with it.
Additional specific details are given here.
If you were to reinstall things, another approach that generally works to keep your compute GPUs out of the display path, is to load the Linux OS with the GPUs not plugged into the system. After things are configured the way you want display-wise, then add the compute GPUs to the system and load the linux toolkit. You will want to manually load the display driver instead of letting the linux toolkit do it, and deselect the option to have the linux display driver installer modify the xorg.conf This will similarly get your GPUs configured for compute usage but keep them out of the display path.

Does it take very long time to load CUDA program on Tesla C1060?

I had my CUDA code run on Linux server,RHEL5.3/Tesla C1060/CUDA 2.3 but it is much slower than I expect
However the data from cuda profiler is fast enough
So it seems that it spent very long time to load the program and the time isn't profiled
Am I right?
I use such code to test whether I'm right
#include<cuda.h>
#include<cuda_runtime.h>
#include<stdio.h>
#include<time.h>
#define B 1
#define T 1
__global__ void test()
{
}
int main()
{
clock_t start=clock();
cudaSetDevice(0);
test<<<B,T>>>();
clock_t end=clock();
printf("time:%dms\n",end-start);
}
and use the command "time" as well as the clock() funtction used in the code to measure the
time
nvcc -o test test.cu
time ./test
the result is
time:4s
real 0m3.311s
user 0m0.005s
sys 0m2.837s
on my own PC,which is Win 8/CUDA5.5/GT 720M/, the same code runs much faster.
The Linux CUDA driver of that era (probably 185 series IIRC) had a "feature" whereby the driver would unload several internal driver components whenever there was not a client connected to the driver. With display GPUs where X11 was active at all times, this was rarely a problem, but for compute GPUs it lead to large latency on first application runs while the driver reinitialised itself, and loss of device settings such as compute exclusive mode, fan speed, etc.
The normal solution was to run the nvidia-smi utility in daemon mode - it acts as a client and stops the the driver from deintialising. Something like this:
nvidia-smi --loop-continuously --interval=60 --filename=/var/log/nvidia-smi.log &
run as root should solve the problem

assert() in CUDA 5.5

I've just upgraded from CUDA 5.0 to 5.5 and all my VS2012 CUDA projects have stopped compiling due to a problem with assert(). To repro the problem, I created a new CUDA 5.5 project in VS 2012 and added the code straight from Programming Guide and got the same error.
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
This produces the following compiler error:
kernel.cu(22): error : calling a __host__ function("_wassert") from a __global__ function("testAssert") is not allowed
Is there something obvious that I'm missing?
Make sure you are including assert.h, and make sure you are targeting sm_20 or later. Also check you're not including Windows headers, and if you are then try without.

CUDA kernel doesn't launch

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.