Cuda Async MemCpy behavior when 'lapping' itself - cuda

Say you have a fairly typical sequence of async compute:
Generate data on host, do other CPU stuff
Put data in async stream to device (via pinned memory)
compute data in (same) async stream
Now if you're running in a loop, the host will loop back to step 1 as 2&3 are non-blocking for the host.
The question is:
"What happens for the host if it gets to step 2 again and the system isn't yet done transferring data to the device?"
Does the host MemCpyAsync block until the previous copy is complete?
Does it get launched like normal with the outbound data being put in
a buffer?
If the latter, presumably this buffer can run out of space if your host is running too fast wrt the device operations?
I'm aware that modern devices have multiple copy engines, but I'm not sure if those would be useful for multiple copies on the same stream and to the same place.
I get that a system that ran into this wouldn't be a well designed one - asking as a point of knowledge.
This isn't something I have encountered in code yet - looking for any pointers to documentation on how this behavior is supposed to work. Already looked at the API page for the copy function and async behavior and didn't see anything I could recognize as relevant.

Does the host MemCpyAsync block until the previous copy is complete?
No, generally not, assuming by "block" you mean block the CPU thread. Asynchronous work items issued to the GPU go into a queue, and control is immediately returned to the CPU thread, before the work has begun. The queue can hold "many" items. Issuance of work can proceed until the queue is full without any other hindrances or dependencies.
It's important to keep in mind one of the two rules of stream semantics:
Items issued into the same stream execute in issue order. Item B, issued after item A, will not begin until A has finished.
So let's say we had a case like this (and assume h_ibuff and h_obuff point to pinned host memory):
cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
}
on the second pass of the loop, the cudaMemcpyAsync operations will be inserted into a queue, but will not begin to execute (or do anything, really) until stream semantics say they can begin. This is really true for each and every op issued by this loop.
A reasonable question might be "what if on each pass of the loop, I wanted different contents in h_ibuff ?" (quite sensible). Then you would need to address that specifically. Inserting a simple memcpy operation for example, by itself, to "reload" h_ibuff isn't going to work. You'd need some sort of synchronization. For example you might decide that you wanted to "refill" h_ibuff while the kernel and subsequent cudaMemcpyAsync D->H operation are happening. You could do something like this:
cudaStream_t stream;
cudaEvent_t event;
cudaEventCreat(&event);
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
cudaEventRecord(event, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
cudaEventSynchronize(event);
memcpy(h_ibuff, databuff+i*chunksize, chunksize); // "refill"
}
This refactoring would allow the asynchronous work to be issued to keep the GPU busy, and "overlap" the copy to "refill" the h_ibuff. It will also prevent the "refill" operation from beginning until the previous buffer contents are safely transferred to the device, and will also prevent the next buffer copy from beginning until the new contents are "reloaded".
This isn't the only way to do this; it's one possible approach.
For this last question asked/answered above, you might ask a similar question: "how about handling the output buffer side?" The mechanism could be very similar, left to the reader.
For structured learning on this topic, you might wish to study the CUDA concurrency section of this lecture series.

Related

several cuda_stream for one cuda kernel

I have a simple question. I am allocating n block of memory associated to a unique cuda_stream like (simplify),[It may be a very bad idea -_-]:
void *ptr = NULL;
cudaStream_t stream;
cudaMallocManaged(&ptr, size);
cudaStreamAttachMemAsync(stream, ptr);
Later in my code I am calling my kernel with 6 of this block of memory (determined by a random process). The cuda launcher takes only one stream argument
update_gpu<<<256, 256,0,???>>>(block1,block2,block3,block4,block5,block6);
??? should be a stream but which one should I pass ? I may synchronize with
cudaDeviceSynchronize()
but it may be too much, as I have a lot of block
cudaStreamSynchronize(...)
look like a solution, should I do it for five of my stream ?
Any suggestions ?
best,
++t
Attaching memory to a specific stream is an optimization, that tells the runtime that this memory does not need to be visible to any other than the specific stream.
If in doubt, just don't attach the memory to any stream, and it will be visible to all kernels.
This is particularly the right approach if you don't use streams at all (which means all kernels are launched to the default stream).
If however you want to take advantage of this optimization, by the time your kernel runs all managed memory must be either not attached at all, or attached to the specific stream that the kernel is launched to.

Eliminate cudaMemcpy between kernel calls

I've got a CUDA kernel that is called many times (1 million is not the limit). Whether we launch kernel again or not depends on flag (result_found), that our kernel returns.
for(int i = 0; i < 1000000 /* for example */; ++i) {
kernel<<<blocks, threads>>>( /*...*/, dev_result_found);
cudaMemcpy(&result_found, dev_result_found, sizeof(bool), cudaMemcpyDeviceToHost);
if(result_found) {
break;
}
}
The profiler says that cudaMemcpy takes much more time to execute, than actual kernel call (cudaMemcpy: ~88us, cudaLaunch: ~17us).
So, the questions are:
1) Is there any way to avoid calling cudaMemcpy here?
2) Why is it so slow after all? Passing parameters to the kernel (cudaSetupArgument) seems very fast (~0.8 us), while getting the result back is slow. If I remove cudaMemcpy, my program finishes a lot faster, so I think that it's not because of synchronization issues.
1) Is there any way to avoid calling cudaMemcpy here?
Yes. This is a case where dynamic parallelism may help. If your device supports it you can move the entire loop over i on to the GPU and launch further kernels from the GPU. The launching thread can then directly read dev_result_found and return if it has finished. This completely removes cudaMemcpy.
An alternative would be to greatly reduce the number of cudaMemcpy calls. At the start of each kernel launch check against dev_result_found. If it is true, return. This way you only need to perform the memcpy every x iterations. While you will launch more kernels than you need to, these will be very cheap as the excess will return immediately.
I suspect a combination of the two methods will give best performance.
2) Why is it so slow after all?
Hard to say. I'd suggest your numbers may be a bit suspicious - I guess you're using the API trace from the profiler. This measures time as seen by the CPU, so if you launch an asynchronous call (kernel launch) followed by a a sychronous call (cudaMemcpy) the cost of synchronisaiton will be measured with the memcpy.
Still, if your kernel is relatively quick-running the overhead of the copy may be significant. You are also unable to hide any launch overheads, as you cannot schedule the next launch asynchronously.

Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.

About cudaMemcpyAsync Function

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

When to call cudaDeviceSynchronize?

when is calling to the cudaDeviceSynchronize function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with cudaMemcpy?
If I do matrix multiplications like
C = A * B
D = C * F
should I put cudaDeviceSynchronize between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes
So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).
cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.
One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
When you want your GPU to start processing some data, you typically do a kernal invocation.
When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.
In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.
You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).
From this post CUDA Dynamic Parallelism API and Principles:
If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.
...
Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
__device__ int v = 0;
__global__ void child_k(void) {
printf("v = %d\n", v);
}
__global__ void parent_k(void) {
v = 1;
child_k <<< 1, 1 >>>> ();
v = 2; // RACE CONDITION
cudaDeviceSynchronize();
}