CUDA stream is blocked when launching many kernels (>1000) - cuda

I found that CUDA stream will block when I launch lots of kernels (more than 1000). I am wondering is there any configuration that I can change?
In my experiments, I launch a small kernel 10000 times. This kernel ran shortly (about 190us). The kernel launched very fast when launching the first 1000 kernels. It takes 4~5us to launch a kernel. But after that, The launch process becomes slow. It takes about 190us to launch a new kernel. The CUDA stream seems to wait for the previous kernel complete and the buffer size is about 1000 kernel.
When I created 3 streams, each stream can launch 1000 kernel asynchrony.
I want to make this buffer bigger. I try to set cudaLimitDevRuntimePendingLaunchCount, but it does not work. Is there any way?
#include <stdio.h>
#include "cuda_runtime.h"
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if (e != cudaSuccess) { \
printf("Failed: Cuda error %s:%d '%s'\n", \
__FILE__,__LINE__,cudaGetErrorString(e)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// a dummy kernel for test
__global__ void add(float *a, int n) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < n; i++) {
a[id] = sqrt(a[id] + 1);
}
}
int main(int argc, char* argv[])
{
// managing 1 devices
int nDev = 1;
int nStream = 1;
int size = 32*1024*1024;
// allocating and initializing device buffers
float** buffer = (float**)malloc(nDev * sizeof(float*));
cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev*nStream);
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
// CUDACHECK(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 10000));
CUDACHECK(cudaMalloc(buffer + i, size * sizeof(float)));
CUDACHECK(cudaMemset(buffer[i], 1, size * sizeof(float)));
for (int j = 0; j < nStream; j++) {
CUDACHECK(cudaStreamCreate(s+i*nStream+j));
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
for (int j=0; j < 10000; j++) {
for (int k=0; k < nStream; k++) {
add<<<32, 1024, 0, s[i*nStream+k]>>>(buffer[i], 1000);
}
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
cudaDeviceSynchronize();
}
// free device buffers
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
CUDACHECK(cudaFree(buffer[i]));
}
printf("Success \n");
return 0;
}
Here is the nvprof results:
When I create 3 streams, the first 3000 kernel launched quickly and then become slow
When I create 1 streams, the first 1000 kernel launched quickly and then become slow

The behavior you are witnessing is expected behavior. If you search on the cuda tag for "queue" or "launch queue" you will find many other questions that refer to it. CUDA has a queue (apparently per-stream) that kernel launches go into. As long as the outstanding launch count is less than the queue depth, the launch process will be asynchronous.
However when the outstanding (i.e. uncompleted) launches exceed the queue depth, the launch process changes to a kind of synchronous behavior (although not synchronous in the usual sense). Specifically, when the outstanding number of kernel launches exceeds the queue depth, the launch process will block the CPU thread that is performing the next launch, until a launch slot opens in the queue (effectively means a kernel has retired at the other end of the queue).
You have no visibility into this (no way to query the number of slots open in the queue) nor any way to view or control the queue depth. Most of the information I'm reciting here is obtained by inspection; it is not formally published in CUDA documentation that I am aware of.
As already discussed in the comments, one possible approach to alleviate your concern around launches in a multi-device scenario is to launch breadth-first rather than depth-first. By this I mean that you should modify your launch loops so that you launch a kernel to device 0, then device 1, then device 2, etc. before launching the next kernel on device 0. This will give you the optimum performance in the sense that all GPUs will be engaged with processing, as early as possible in the launch sequence.
If you'd like to see changes in CUDA behavior or documentation, the general suggestion is to become a registered developer at developer.nvidia.com, then log into your account there and file a bug, using the bug filing process accessible by clicking on your account name in the upper right hand corner.

Related

Free memory occupied by cudaMemGetInfo

I have the following simple code to find available GPUs
int * getFreeGpuList(int *numFree) {
int * gpuList;
int nDevices;
int i, j = 0, count = 0;
cudaGetDeviceCount(&nDevices);
gpuList = (int *) malloc(nDevices * sizeof(int));
for (i = 0; i < nDevices; ++i) {
cudaSetDevice(i);
size_t freeMem;
size_t totalMem;
cudaMemGetInfo(&freeMem, &totalMem);
if (freeMem > .9 * totalMem) {
gpuList[j] = i;
count++;
j++;
}
}
*numFree = count;
return gpuList;
}
The problem is that cudaMemGetInfo occupies some memory (~150MB in my case) in each GPU. This code is a part of a bigger program that runs for a long time, and I often run several processes at the same time, so in the end the memory occupied by this function is significant. Could you please let me know how I can free the GPU memory occupied by cudaMemGetInfo? Thanks!
Based on some insight from talonmies above that cudaSetDevice creates a context and occupies some memory in the device, I found out that cudaDeviceReset can "explicitly destroys and cleans up all resources associated with the current device in the current process" without affecting other processes on the same device.
Update Nov 26: If one wants to query GPU info, it's better using the NVML library. In my experience, it is much faster and does not take up memory for simple memory and name queryings.

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.
The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.

Issues with CUDA streams

I am running CUBLAS v2.0 on different streams on a single GPU (Tesla C2050) by subdividing the input matrices (A[x/num_of_streams*y]B[xy] = C[x/num_of_streams*y]), but somehow it is taking more time when I use CUDA streams. Here is the code snippet:
//plan is a struct containing the matrix dimensions and stream numbers
//parallel in nstreams - should be! MAX 16 streams could run concurrently
//Copy A - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
//Copy B - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
//Create handles - serial
for(i = 0; i < nstreams; i++)
handle[i] = create_handle();
//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm...
for(i = 0; i < nstreams; i++)
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//Destroy handles - serial
for(i = 0; i < nstreams; i++)
destroy_handle (handle[i]);
//Copy C - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyC_in_streams (&plan[i]);
//EDIT: Function body
//The other two copy functions are exactly the same as this
void cudgemm_copyA_in_streams(TGPUplan *plan)
{
cudasafe(cudaMemcpyAsync(plan->Ad_Data, plan->Ah_Data, (plan->Acols * plan->Arows * sizeof(double)), cudaMemcpyHostToDevice, plan->stream) );
}
//Create handle
cublasHandle_t create_handle ()
{
cublasHandle_t handle;
checkError(cublasCreate(&handle), "cublasCreate() error!\n");
return handle;
}
//Destroy handle
void destroy_handle (cublasHandle_t handle)
{
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
//Kernel
void cudgemm_kernel_in_streams(TGPUplan *plan, cublasHandle_t handle, const double alpha, const double beta)
{
cublasStatus_t ret;
cublasSetStream(handle, plan->stream);
ret = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, plan->Arows, plan->Ccols, plan->Acols, &alpha, plan->Ad_Data, plan->Arows, plan->Bd_Data, plan->Brows, &beta, plan->Cd_Data, plan->Crows);
checkError(ret, "cublas Dgemm returned an error!\n");
}
So I am bouncing back and forth between streams and assigning work, expecting to get a better execution time, but I notice that more the number of streams, the program takes more time as compared to the version that does not uses stream. Where am I going wrong?
Cross post to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=209420
EDIT:
I modified my program as follows:
//copy data
for(i = 0; i < nstreams; i++)
{
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
//Run kernel and copy back
for(i = 0; i < nstreams; i++)
{
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
cudgemm_copyC_in_streams (&plan[i]);
}
When I profile my program for a matrix order of 6144, I get the following info:
Kernel time = 42.75 % of total GPU time
Memory copy time = 28.9 % of total GPU time
Kernel taking maximum time = fermiDgemm_v2_kernel_val (42.8% of total GPU time)
Memory copy taking maximum time = memcpyHtoDasync (21.7% of total GPU time)
Total overlap time in GPU = 65268.3 micro sec. (3.6% of total GPU time)
When I time the above loop, I get an time of 0.000284s, vs 1.703289s for the version that does not uses streams (in that version also, I time the two sequential memory copies, kernel invocation and the remaining memCpy).
I think since I am not using any synchronization constructs, may be I am printing the time before the computation actually finishes (I find it difficult to believe that there is a 100% improvement).
I suggest two changes:
1) move your cuBLAS handle creation/destruction to outside the copies and kernel invocations. It's possible it is breaking concurrency by doing an unneeded context synchronize.
2) do the memcpy's together in one loop over the streams. That way, the B copy of stream 0 does not do any extra synchronization to wait until the A memcpy has been completed. i.e. do this:
for(i = 0; i < nstreams; i++) {
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
not this:
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
Don't be surprised if you are unable to get a speedup of more than 40% or so from overlapping transfers and computation. Streams deliver the biggest benefits on workloads that spend equal time transferring and processing data, and very few workloads fall into that category.
I would also suggest to check the SIZE of the copies, you should start using different streams only
when the time to transfer one block of memory can be compared to the time needed to compute on it.
If the time to transfer is little compared to the computation time, then adding streams add more overhead with their management.
Use the Visual Profiler to see how long it takes the various steps. Make a graph with different memory inputs.

Get rid of busy waiting during asynchronous cuda stream executions

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
while (true) {
if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
break;
}
sid = ++sid % S_N;
}
}
Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?
EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.
EDIT2: I got rid of busy-waiting in fallowing way:
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.
The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.
However, you need to understand how streams and sychronization work before you will be able to use them in your code.
What happens if you do not use streams at all? Consider the following code:
kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();
The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().
Now, what if you have two different kernels?
kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);
kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:
kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);
There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.
So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
}
Now, on to streams. you can use streams to manage concurrent device execution.
Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.
That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
No explicit device synchronization necessary.
My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed.
Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.
Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news
EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.
There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.
Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?
Instead of cudaStreamQuery, you want cudaStreamSynchronize
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaStreamSynchronize(streams[sid]);
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)
You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.
You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.
like this:
size = N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}
In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.