Overlapping transfers and kernel executions in CUDA with two loops

Overlapping transfers and kernel executions in CUDA with two loops - cuda

I want to overlap data transfers and kernel executions in a form like this:
int numStreams = 3;
int size = 10;
for(int i = 0; i < size; i++) {
cuMemcpyHtoDAsync( _bufferIn1,
_host_memoryIn1 ),
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuMemcpyHtoDAsync( _bufferIn2,
_host_memoryIn2,
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuLaunchKernel( _kernel,
gs.x(), gs.y(), gs.z(),
bs.x(), bs.y(), bs.z(),
_memory_size,
cuda_stream[i % numStreams],
_kernel_arguments,
0
);
cuEventRecord(event[i], cuda_stream);
}
for(int i = 0; i < size; i++) {
cuEventSynchronize(events[i]);
cuMemcpyDtoHAsync( _host_memoryOut,
_bufferOut,
_size * sizeof(T),
cuda_stream[i % numStreams]);
}
Is overlapping possible in this case? Currently only the HtoD-transfers overlap with the kernel executions. The first DtoH-transfer is executed after the last kernel execution.

Overlapping is possible only when the operations are executed on different streams. CUDA operations in the same stream are executed sequentially by the host calling order so that the copy from the device to host at the end will be executed once all the operations on corresponding streams are completed. The overlap doesn't happen because both the last kernel and the first copy are executed on stream 0, so the copy has to wait for the kernel to finish. Since you are synchronizing with an event at each loop iteration, the other copies on the other streams (stream 1 and 2) are not called yet.

Related

CUDA stream is blocked when launching many kernels (>1000)

I found that CUDA stream will block when I launch lots of kernels (more than 1000). I am wondering is there any configuration that I can change?
In my experiments, I launch a small kernel 10000 times. This kernel ran shortly (about 190us). The kernel launched very fast when launching the first 1000 kernels. It takes 4~5us to launch a kernel. But after that, The launch process becomes slow. It takes about 190us to launch a new kernel. The CUDA stream seems to wait for the previous kernel complete and the buffer size is about 1000 kernel.
When I created 3 streams, each stream can launch 1000 kernel asynchrony.
I want to make this buffer bigger. I try to set cudaLimitDevRuntimePendingLaunchCount, but it does not work. Is there any way?
#include <stdio.h>
#include "cuda_runtime.h"
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if (e != cudaSuccess) { \
printf("Failed: Cuda error %s:%d '%s'\n", \
__FILE__,__LINE__,cudaGetErrorString(e)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// a dummy kernel for test
__global__ void add(float *a, int n) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < n; i++) {
a[id] = sqrt(a[id] + 1);
}
}
int main(int argc, char* argv[])
{
// managing 1 devices
int nDev = 1;
int nStream = 1;
int size = 32*1024*1024;
// allocating and initializing device buffers
float** buffer = (float**)malloc(nDev * sizeof(float*));
cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev*nStream);
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
// CUDACHECK(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 10000));
CUDACHECK(cudaMalloc(buffer + i, size * sizeof(float)));
CUDACHECK(cudaMemset(buffer[i], 1, size * sizeof(float)));
for (int j = 0; j < nStream; j++) {
CUDACHECK(cudaStreamCreate(s+i*nStream+j));
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
for (int j=0; j < 10000; j++) {
for (int k=0; k < nStream; k++) {
add<<<32, 1024, 0, s[i*nStream+k]>>>(buffer[i], 1000);
}
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
cudaDeviceSynchronize();
}
// free device buffers
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
CUDACHECK(cudaFree(buffer[i]));
}
printf("Success \n");
return 0;
}
Here is the nvprof results:
When I create 3 streams, the first 3000 kernel launched quickly and then become slow
When I create 1 streams, the first 1000 kernel launched quickly and then become slow

The behavior you are witnessing is expected behavior. If you search on the cuda tag for "queue" or "launch queue" you will find many other questions that refer to it. CUDA has a queue (apparently per-stream) that kernel launches go into. As long as the outstanding launch count is less than the queue depth, the launch process will be asynchronous.
However when the outstanding (i.e. uncompleted) launches exceed the queue depth, the launch process changes to a kind of synchronous behavior (although not synchronous in the usual sense). Specifically, when the outstanding number of kernel launches exceeds the queue depth, the launch process will block the CPU thread that is performing the next launch, until a launch slot opens in the queue (effectively means a kernel has retired at the other end of the queue).
You have no visibility into this (no way to query the number of slots open in the queue) nor any way to view or control the queue depth. Most of the information I'm reciting here is obtained by inspection; it is not formally published in CUDA documentation that I am aware of.
As already discussed in the comments, one possible approach to alleviate your concern around launches in a multi-device scenario is to launch breadth-first rather than depth-first. By this I mean that you should modify your launch loops so that you launch a kernel to device 0, then device 1, then device 2, etc. before launching the next kernel on device 0. This will give you the optimum performance in the sense that all GPUs will be engaged with processing, as early as possible in the launch sequence.
If you'd like to see changes in CUDA behavior or documentation, the general suggestion is to become a registered developer at developer.nvidia.com, then log into your account there and file a bug, using the bug filing process accessible by clicking on your account name in the upper right hand corner.

CUDA streams performance

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each block contributes its own reduced sum.
I also have a host function dot_gpu that calls the kernel and reduces the vector result to the final dot product value.
The synchronous version does just this:
// copy to device
copy_to_device<double>(x_h, x_d, n);
copy_to_device<double>(y_h, y_d, n);
// kernel
double result = dot_gpu(x_d, y_d, n, blockNum, blockSize);
while the async one goes like:
double result[numChunks];
for (int i = 0; i < numChunks; i++) {
int offset = i * chunkSize;
// copy to device
copy_to_device_async<double>(x_h+offset, x_d+offset, chunkSize, stream[i]);
copy_to_device_async<double>(y_h+offset, y_d+offset, chunkSize, stream[i]);
// kernel
result[i] = dot_gpu(x_d+offset, y_d+offset, chunkSize, blockNum, blockSize, stream[i]);
}
for (int i = 0; i < numChunks; i++) {
finalResult += result[i];
cudaStreamDestroy(stream[i]);
}
I am getting worse performance when using streams and was trying to investigate the reasons. I tried to pipeline the downloads, kernel calls and uploads, but with no results.
// accumulate the result of each block into a single value
double dot_gpu(const double *x, const double* y, int n, int blockNum, int blockSize, cudaStream_t stream=NULL)
{
double* result = malloc_device<double>(blockNum);
dot_gpu_kernel<<<blockNum, blockSize, blockSize * sizeof(double), stream>>>(x, y, result, n);
#if ASYNC
double* r = malloc_host_pinned<double>(blockNum);
copy_to_host_async<double>(result, r, blockNum, stream);
CudaEvent copyResult;
copyResult.record(stream);
copyResult.wait();
#else
double* r = malloc_host<double>(blockNum);
copy_to_host<double>(result, r, blockNum);
#endif
double dotProduct = 0.0;
for (int i = 0; i < blockNum; i ++) {
dotProduct += r[i];
}
cudaFree(result);
#if ASYNC
cudaFreeHost(r);
#else
free(r);
#endif
return dotProduct;
}
My guess is that the problem is inside the dot_gpu() functions that doesn't only call the kernel. Tell me if I understand correctly the following stream executions
foreach stream {
cudaMemcpyAsync( device[stream], host[stream], ... stream );
LaunchKernel<<<...stream>>>( ... );
cudaMemcpyAsync( host[stream], device[stream], ... stream );
}
The host executes all the three instructions without being blocked, since cudaMemcpyAsync and kernel return immediately (however on the GPU they will execute sequentially as they are assigned to the same stream). So host goes on to the next stream (even if stream1 who knows what stage it is at, but who cares.. it's doing his job on the GPU, right?) and executes the three instructions again without being blocked.. and so on and so forth. However, my code blocks the host before it can process the next stream, somewhere inside the dot_gpu() function. Is it because I am allocating & freeing stuff, as well as reducing the array returned by the kernel to a single value?

Assuming your objectified CUDA interface does what the function and method names suggest, there are three reasons why work from subsequent calls to dot_gpu() might not overlap:
Your code explicitly blocks by recording an event and waiting for it.
If it weren't blocking for 1. already, your code would block on the pinned host side allocation and deallocation, as you suspected.
If your code weren't blocking for 2. already, work from subsequent calls to dot_gpu() might still not overlap depending on compute capbility. Devices of compute capability 3.0 or lower do not reorder operations even if they are enqueued to different streams.
Even for devices of compute capability 3.5 and higher the number of streams whose operations can be reordered is limited by the CUDA_DEVICE_MAX_CONNECTIONS environment variable, which defaults to 8 and can be set to values as large as 32.

Issues with CUDA streams

I am running CUBLAS v2.0 on different streams on a single GPU (Tesla C2050) by subdividing the input matrices (A[x/num_of_streams*y]B[xy] = C[x/num_of_streams*y]), but somehow it is taking more time when I use CUDA streams. Here is the code snippet:
//plan is a struct containing the matrix dimensions and stream numbers
//parallel in nstreams - should be! MAX 16 streams could run concurrently
//Copy A - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
//Copy B - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
//Create handles - serial
for(i = 0; i < nstreams; i++)
handle[i] = create_handle();
//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm...
for(i = 0; i < nstreams; i++)
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//Destroy handles - serial
for(i = 0; i < nstreams; i++)
destroy_handle (handle[i]);
//Copy C - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyC_in_streams (&plan[i]);
//EDIT: Function body
//The other two copy functions are exactly the same as this
void cudgemm_copyA_in_streams(TGPUplan *plan)
{
cudasafe(cudaMemcpyAsync(plan->Ad_Data, plan->Ah_Data, (plan->Acols * plan->Arows * sizeof(double)), cudaMemcpyHostToDevice, plan->stream) );
}
//Create handle
cublasHandle_t create_handle ()
{
cublasHandle_t handle;
checkError(cublasCreate(&handle), "cublasCreate() error!\n");
return handle;
}
//Destroy handle
void destroy_handle (cublasHandle_t handle)
{
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
//Kernel
void cudgemm_kernel_in_streams(TGPUplan *plan, cublasHandle_t handle, const double alpha, const double beta)
{
cublasStatus_t ret;
cublasSetStream(handle, plan->stream);
ret = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, plan->Arows, plan->Ccols, plan->Acols, &alpha, plan->Ad_Data, plan->Arows, plan->Bd_Data, plan->Brows, &beta, plan->Cd_Data, plan->Crows);
checkError(ret, "cublas Dgemm returned an error!\n");
}
So I am bouncing back and forth between streams and assigning work, expecting to get a better execution time, but I notice that more the number of streams, the program takes more time as compared to the version that does not uses stream. Where am I going wrong?
Cross post to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=209420
EDIT:
I modified my program as follows:
//copy data
for(i = 0; i < nstreams; i++)
{
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
//Run kernel and copy back
for(i = 0; i < nstreams; i++)
{
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
cudgemm_copyC_in_streams (&plan[i]);
}
When I profile my program for a matrix order of 6144, I get the following info:
Kernel time = 42.75 % of total GPU time
Memory copy time = 28.9 % of total GPU time
Kernel taking maximum time = fermiDgemm_v2_kernel_val (42.8% of total GPU time)
Memory copy taking maximum time = memcpyHtoDasync (21.7% of total GPU time)
Total overlap time in GPU = 65268.3 micro sec. (3.6% of total GPU time)
When I time the above loop, I get an time of 0.000284s, vs 1.703289s for the version that does not uses streams (in that version also, I time the two sequential memory copies, kernel invocation and the remaining memCpy).
I think since I am not using any synchronization constructs, may be I am printing the time before the computation actually finishes (I find it difficult to believe that there is a 100% improvement).

I suggest two changes:
1) move your cuBLAS handle creation/destruction to outside the copies and kernel invocations. It's possible it is breaking concurrency by doing an unneeded context synchronize.
2) do the memcpy's together in one loop over the streams. That way, the B copy of stream 0 does not do any extra synchronization to wait until the A memcpy has been completed. i.e. do this:
for(i = 0; i < nstreams; i++) {
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
not this:
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
Don't be surprised if you are unable to get a speedup of more than 40% or so from overlapping transfers and computation. Streams deliver the biggest benefits on workloads that spend equal time transferring and processing data, and very few workloads fall into that category.

I would also suggest to check the SIZE of the copies, you should start using different streams only
when the time to transfer one block of memory can be compared to the time needed to compute on it.
If the time to transfer is little compared to the computation time, then adding streams add more overhead with their management.
Use the Visual Profiler to see how long it takes the various steps. Make a graph with different memory inputs.

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!

If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.

While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

Get rid of busy waiting during asynchronous cuda stream executions

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
while (true) {
if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
break;
}
sid = ++sid % S_N;
}
}
Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?
EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.
EDIT2: I got rid of busy-waiting in fallowing way:
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.

The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.
However, you need to understand how streams and sychronization work before you will be able to use them in your code.
What happens if you do not use streams at all? Consider the following code:
kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();
The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().
Now, what if you have two different kernels?
kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);
kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:
kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);
There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.
So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
}
Now, on to streams. you can use streams to manage concurrent device execution.
Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.
That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
No explicit device synchronization necessary.

My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed.
Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.
Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news
EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.

There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.
Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?

Instead of cudaStreamQuery, you want cudaStreamSynchronize
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaStreamSynchronize(streams[sid]);
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)
You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.

You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.
like this:
size = N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}
In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008