Get rid of busy waiting during asynchronous cuda stream executions - cuda

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
while (true) {
if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
break;
}
sid = ++sid % S_N;
}
}
Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?
EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.
EDIT2: I got rid of busy-waiting in fallowing way:
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.

The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.
However, you need to understand how streams and sychronization work before you will be able to use them in your code.
What happens if you do not use streams at all? Consider the following code:
kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();
The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().
Now, what if you have two different kernels?
kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);
kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:
kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);
There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.
So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
}
Now, on to streams. you can use streams to manage concurrent device execution.
Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.
That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
No explicit device synchronization necessary.

My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed.
Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.
Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news
EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.

There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.
Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?

Instead of cudaStreamQuery, you want cudaStreamSynchronize
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaStreamSynchronize(streams[sid]);
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)
You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.

You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.
like this:
size = N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}
In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.

Related

CUDA stream is blocked when launching many kernels (>1000)

I found that CUDA stream will block when I launch lots of kernels (more than 1000). I am wondering is there any configuration that I can change?
In my experiments, I launch a small kernel 10000 times. This kernel ran shortly (about 190us). The kernel launched very fast when launching the first 1000 kernels. It takes 4~5us to launch a kernel. But after that, The launch process becomes slow. It takes about 190us to launch a new kernel. The CUDA stream seems to wait for the previous kernel complete and the buffer size is about 1000 kernel.
When I created 3 streams, each stream can launch 1000 kernel asynchrony.
I want to make this buffer bigger. I try to set cudaLimitDevRuntimePendingLaunchCount, but it does not work. Is there any way?
#include <stdio.h>
#include "cuda_runtime.h"
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if (e != cudaSuccess) { \
printf("Failed: Cuda error %s:%d '%s'\n", \
__FILE__,__LINE__,cudaGetErrorString(e)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// a dummy kernel for test
__global__ void add(float *a, int n) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < n; i++) {
a[id] = sqrt(a[id] + 1);
}
}
int main(int argc, char* argv[])
{
// managing 1 devices
int nDev = 1;
int nStream = 1;
int size = 32*1024*1024;
// allocating and initializing device buffers
float** buffer = (float**)malloc(nDev * sizeof(float*));
cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev*nStream);
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
// CUDACHECK(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 10000));
CUDACHECK(cudaMalloc(buffer + i, size * sizeof(float)));
CUDACHECK(cudaMemset(buffer[i], 1, size * sizeof(float)));
for (int j = 0; j < nStream; j++) {
CUDACHECK(cudaStreamCreate(s+i*nStream+j));
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
for (int j=0; j < 10000; j++) {
for (int k=0; k < nStream; k++) {
add<<<32, 1024, 0, s[i*nStream+k]>>>(buffer[i], 1000);
}
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
cudaDeviceSynchronize();
}
// free device buffers
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
CUDACHECK(cudaFree(buffer[i]));
}
printf("Success \n");
return 0;
}
Here is the nvprof results:
When I create 3 streams, the first 3000 kernel launched quickly and then become slow
When I create 1 streams, the first 1000 kernel launched quickly and then become slow
The behavior you are witnessing is expected behavior. If you search on the cuda tag for "queue" or "launch queue" you will find many other questions that refer to it. CUDA has a queue (apparently per-stream) that kernel launches go into. As long as the outstanding launch count is less than the queue depth, the launch process will be asynchronous.
However when the outstanding (i.e. uncompleted) launches exceed the queue depth, the launch process changes to a kind of synchronous behavior (although not synchronous in the usual sense). Specifically, when the outstanding number of kernel launches exceeds the queue depth, the launch process will block the CPU thread that is performing the next launch, until a launch slot opens in the queue (effectively means a kernel has retired at the other end of the queue).
You have no visibility into this (no way to query the number of slots open in the queue) nor any way to view or control the queue depth. Most of the information I'm reciting here is obtained by inspection; it is not formally published in CUDA documentation that I am aware of.
As already discussed in the comments, one possible approach to alleviate your concern around launches in a multi-device scenario is to launch breadth-first rather than depth-first. By this I mean that you should modify your launch loops so that you launch a kernel to device 0, then device 1, then device 2, etc. before launching the next kernel on device 0. This will give you the optimum performance in the sense that all GPUs will be engaged with processing, as early as possible in the launch sequence.
If you'd like to see changes in CUDA behavior or documentation, the general suggestion is to become a registered developer at developer.nvidia.com, then log into your account there and file a bug, using the bug filing process accessible by clicking on your account name in the upper right hand corner.

CUDA streams performance

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each block contributes its own reduced sum.
I also have a host function dot_gpu that calls the kernel and reduces the vector result to the final dot product value.
The synchronous version does just this:
// copy to device
copy_to_device<double>(x_h, x_d, n);
copy_to_device<double>(y_h, y_d, n);
// kernel
double result = dot_gpu(x_d, y_d, n, blockNum, blockSize);
while the async one goes like:
double result[numChunks];
for (int i = 0; i < numChunks; i++) {
int offset = i * chunkSize;
// copy to device
copy_to_device_async<double>(x_h+offset, x_d+offset, chunkSize, stream[i]);
copy_to_device_async<double>(y_h+offset, y_d+offset, chunkSize, stream[i]);
// kernel
result[i] = dot_gpu(x_d+offset, y_d+offset, chunkSize, blockNum, blockSize, stream[i]);
}
for (int i = 0; i < numChunks; i++) {
finalResult += result[i];
cudaStreamDestroy(stream[i]);
}
I am getting worse performance when using streams and was trying to investigate the reasons. I tried to pipeline the downloads, kernel calls and uploads, but with no results.
// accumulate the result of each block into a single value
double dot_gpu(const double *x, const double* y, int n, int blockNum, int blockSize, cudaStream_t stream=NULL)
{
double* result = malloc_device<double>(blockNum);
dot_gpu_kernel<<<blockNum, blockSize, blockSize * sizeof(double), stream>>>(x, y, result, n);
#if ASYNC
double* r = malloc_host_pinned<double>(blockNum);
copy_to_host_async<double>(result, r, blockNum, stream);
CudaEvent copyResult;
copyResult.record(stream);
copyResult.wait();
#else
double* r = malloc_host<double>(blockNum);
copy_to_host<double>(result, r, blockNum);
#endif
double dotProduct = 0.0;
for (int i = 0; i < blockNum; i ++) {
dotProduct += r[i];
}
cudaFree(result);
#if ASYNC
cudaFreeHost(r);
#else
free(r);
#endif
return dotProduct;
}
My guess is that the problem is inside the dot_gpu() functions that doesn't only call the kernel. Tell me if I understand correctly the following stream executions
foreach stream {
cudaMemcpyAsync( device[stream], host[stream], ... stream );
LaunchKernel<<<...stream>>>( ... );
cudaMemcpyAsync( host[stream], device[stream], ... stream );
}
The host executes all the three instructions without being blocked, since cudaMemcpyAsync and kernel return immediately (however on the GPU they will execute sequentially as they are assigned to the same stream). So host goes on to the next stream (even if stream1 who knows what stage it is at, but who cares.. it's doing his job on the GPU, right?) and executes the three instructions again without being blocked.. and so on and so forth. However, my code blocks the host before it can process the next stream, somewhere inside the dot_gpu() function. Is it because I am allocating & freeing stuff, as well as reducing the array returned by the kernel to a single value?
Assuming your objectified CUDA interface does what the function and method names suggest, there are three reasons why work from subsequent calls to dot_gpu() might not overlap:
Your code explicitly blocks by recording an event and waiting for it.
If it weren't blocking for 1. already, your code would block on the pinned host side allocation and deallocation, as you suspected.
If your code weren't blocking for 2. already, work from subsequent calls to dot_gpu() might still not overlap depending on compute capbility. Devices of compute capability 3.0 or lower do not reorder operations even if they are enqueued to different streams.
Even for devices of compute capability 3.5 and higher the number of streams whose operations can be reordered is limited by the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable, which defaults to 8 and can be set to values as large as 32.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

Are cudaMalloc and cudaFree synchronous or asynchronous calls?

I want to test whether cudaMalloc and cudaFree are synchronous calls, so I did some modification to the "simpleMultiGPU.cu" sample code in CUDA SDK. Following is the part I changed (the added lines are not indented):
float *dd[GPU_N];;
for (i = 0; i < GPU_N; i++){cudaSetDevice(i); cudaMalloc((void**)&dd[i], sizeof(float));}
//Start timing and compute on GPU(s)
printf("Computing with %d GPUs...\n", GPU_N);
StartTimer();
//Copy data to GPU, launch the kernel and copy data back. All asynchronously
for (i = 0; i < GPU_N; i++)
{
//Set device
checkCudaErrors(cudaSetDevice(i));
//Copy input data from CPU
checkCudaErrors(cudaMemcpyAsync(plan[i].d_Data, plan[i].h_Data, plan[i].dataN * sizeof(float), cudaMemcpyHostToDevice, plan[i].stream));
//Perform GPU computations
reduceKernel<<<BLOCK_N, THREAD_N, 0, plan[i].stream>>>(plan[i].d_Sum, plan[i].d_Data, plan[i].dataN);
getLastCudaError("reduceKernel() execution failed.\n");
//Read back GPU results
checkCudaErrors(cudaMemcpyAsync(plan[i].h_Sum_from_device, plan[i].d_Sum, ACCUM_N *sizeof(float), cudaMemcpyDeviceToHost, plan[i].stream));
cudaMalloc((void**)&dd[i],sizeof(float));
cudaFree(dd[i]);
//cudaStreamSynchronize(plan[i].stream);
}
By commenting out the cudaMalloc line and cudaFree line respectively in the large loop, I found that for a 2-GPU system, the GPU processing time are 30 milliseconds and 20 milliseconds respectively, so I concluded that cudaMalloc is an asynchronous call and cudaFree is a synchronous call. Not sure this is true or not, and not sure what is the designing concern of the CUDA architecture.
My computation capability is 2.0, and I tried both cuda4.0 and cuda5.0.
Both functions are synchronous.

synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.