how to synchronize between cuda kernel function? - cuda

I have two cuda kernel functions like this
a<<<BLK_SIZE,THR_SIZE>>>(params,...);
b<<<BLK_SIZE,THR_SIZE>>>(params,...);
After function a started, I want to wait until a finishes and then start function b.
so I inserted cudaThreadSynchronize() between a and b like this,
a<<<BLK_SIZE,THR_SIZE>>>(params,...);
err=cudaThreadSynchronize();
if( err != cudaSuccess)
printf("cudaThreadSynchronize error: %s\n", cudaGetErrorString(err));
b<<<BLK_SIZE,THR_SIZE>>>(params,...);
but cudaThreadSynchronize() returns error code: the launch timed out and was terminated cuda error
how can I fix it?
A simple code explanation:
mmap(sequence file);
mmap(reference file);
cudaMemcpy(seq_cuda, sequence);
cudaMemcpy(ref_cuda,reference);
kernel<<<>>>(params); //find short sequence in reference
cudaThreadSynchronize();
kernel<<<>>>(params);
cudaMemcpy(result, result_cuda);
report result
and in kernel function, there is a big for loop which contains some if-else for the pattern matching algorithm to reduce number of comparisons.

This launch error means that something went wrong when your first kernel was launched or maybe even something before that. To work your way out of this, try checking the output of all CUDA runtime calls for errors. Also, do a cudaThreadSync followed by error check after all kernel calls. This should help you find the first place where the error occurs.
If it is indeed a launch failure, you will need to investigate the execution configuration and the code of the kernel to find the cause of your error.
Finally, note that it is highly unlikely that your action of adding in a cudaThreadSynchronize caused this error. I say this because the way you have worded the query points to the cudaThreadSynchronize as the culprit. All this call did was to catch your existing error earlier.

Related

CUDA: invalid device ordinal

I have the following problem. I want to allow my users to choose which GPU to run on. So I was testing on my machine which has only one GPU (device 0) what would happen if they choose a device which doesn't exist.
If I do cudaSetDevice(0); it will work fine.
If I do: cudaSetDevice(1); it will crash with invalid device ordinal (I can handle this as the function returns an error).
If I do: cudaSetDevice(0); cudaSetDevice(1); it will crash with invalid device ordinal (I can handle this as the function returns an error).
However! If I do: cudaSetDevice(1); cudaSetDevice(0); the second command returns success but on the first calculation I try to compute on my GPU it will crash with invalid device ordinal. I cannot handle this because the second command does not return an error!
It seems to me like the first cudaSetDevice leaves something lying around which affects the second command?
Thanks very much!
Solution: (Thanks to Robert Crovella!).
I was handling the errors like:
error = cudaSetDevice(1);
if (error) { blabla }
But apparently you need to call cudaGetLastError() after the cudaSetDevice(1) because otherwise the error message is not removed from some error stack and it just crashes later on where I was doing cudaGetLastError() for another function even though there was no error at this point.
You have to check how many GPU's are available in your system first. It's possible by the use of cudaGetDeviceCount.
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
Then check if the user input is greater than the available devices.
if (userDeviceInput < deviceCount)
{
cudaSetDevice(userDeviceInput);
}
else
{
printf("error: invalid device choosen\n");
}
Remind thatcudaSetDeviceis 0-index-based! Therefor I check userDeviceInput < deviceCount.

Running zero blocks in cuda

I have a loop like this:
while ( ... ) {
...
kernel<<<blocks, threads>>>( ... );
}
and in some iterations blocks or threads have value 0. When I use this my code runs. My question is if this is considered bad practice, and if there are any other bad side effects.
It's bad practice because it will interfere with proper CUDA error checking.
If you do proper error checking, your kernel launches that have all-zero values for block or grid dimensions will throw an error.
It's preferable to write error free programs for a variety of reasons.
Instead, include a test for these cases and skip the kernel launch when your dimensions are zero. The small overhead in C-code to do this will be more than offset by the reduced API overhead by not making the spurious kernel launch request.
I have tried zero block kernel invocation by simply writing following empty kernel.
File:
#include<stdio.h>
__global__ void fg()
{
}
int main()
{
fg<<<0,1>>>();
}
What I noticed was the only side effect was in terms of the time required for execution.
Run time :
real 0m0.242s,
user 0m0.004s,
sys 0m0.148s.
When I run the same file with kernel invocation commented the side effect of overhead in time decreases.
Run time:
real 0m0.003s,
user 0m0.000s,
sys 0m0.000s.
This side effect arises due to the kernel invocation over head for zero blocks.

the design of inner function _S_oom_malloc in SGI-STL allocator

The code is as follows:
template <int __inst>
void*
__malloc_alloc_template<__inst>::_S_oom_malloc(size_t __n)
{
void (* __my_malloc_handler)();
void* __result;
for (;;) {
__my_malloc_handler = __malloc_alloc_oom_handler;
if (0 == __my_malloc_handler) { __THROW_BAD_ALLOC; }
(*__my_malloc_handler)();
__result = malloc(__n);
if (__result) return(__result);
}
}
I have two questions.
1. why does _S_oom_malloc use infinite loop?
2. as we known, _S_oom_malloc will be called when malloc fails in __malloc_alloc_template::allocate function. And why does it use malloc to allocate space?
Anyone can help me? Thanks a lot.
First, the loop is not truly infinite. There are two exits: by throwing a BAD_ALLOC exception an by allocating the requested amount of memory. The exception will be thrown when the current new-handler is a null pointer.
To understand how that can happen, consult e.g. Item 49 from Effective C++. Essentially any new-handler can either
Make more memory available
Install a different new-handler
Deinstall the new-handler (i.e. passing a null pointer to set_new_handler
Throw an exception
abort or exit
Second, the reason that it uses the C library's malloc to allocate space is that malloc on most systems is a well-tested and efficiently implemented function. The Standard Library's new functions are "simply" exception-safe and type-safe wrappers around it (which you as a user can also override, should you want that).

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).
This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080
With regards to frequency, it probably crashes 50% of the time that I run the executable.
From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).
I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.
I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.
Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.
#include <iostream>
#include <stdlib.h>
using namespace std;
__global__ void test_trans (float *a, int length) {
if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
float temp=0.7;
for (int i=0;i<100;i++) {
temp=atanh(temp);
temp=tanh(temp);
}
a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
}
}
int main () {
float *array_dev;
float *array_host;
unsigned int size=10000000;
if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
cerr << "Error with memory Allocation\n"; exit (-1);}
array_host = new float [size];
for (int i=0;i<10;i++) {
test_trans <<< size/512+1, 512 >>> (array_dev, size);
if (cudaSuccess != cudaDeviceSynchronize()) {
cerr << "Error with kernel\n"; exit (-1);}
}
cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);
cout << array_host[size-1] << "\n";
}
Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.
The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.

Error recording time with CudaEvent

I am using cudaEvent methods to find the time my kernel takes to execute.Here is the code as given in the manual.
cudaEvent_t start,stop;
float time=0;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
subsampler<<<gridSize,blockSize>>>(img_redd,img_greend,img_blued,img_height,img_width,final_device_r,final_device_g,final_device_b);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
Now when I run this and try to see the time it comes something like 52428800.0000(values differ but are of this order) .I know it is in milliseconds but still this is a huge number especially when this program execution doesnt take more than a minute.Can someone point out why this is happening .I really need to find how much time the kernel takes to execute.
You should check the return values of each of those CUDA calls. At the very least call cudaGetLastError() at the end to check everything was successful.
If you get an error during the kernel execution then try running your app with cuda-memcheck, especially if you have an Unspecified Launch Failure, to check for illegal memory accesses.