About cudaMemcpyAsync Function - cuda

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?

The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

Related

Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.

Running several streams (instead of threads/blocks) in parallel

I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available.
The code is like:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).
The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.
Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.
News:
I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads.
Here some observations:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- sleep(10)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- cudaStreamSynchronize(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?
Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.
Edit2:
Here an example to clarify my problem:
To keep it simple we have 2 streams:
Stream1:
------------
HostToGPU1
kernel1
GPUToHost1
Stream2:
------------
HostToGPU2
kernel2
GPUToHost2
The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool.
GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).
Concurrent kernel execution can be most easily witnessed on linux.
For a good example and an easy test, refer to the concurrent kernels sample.
Good concurrency among kernels generally requires several things:
a device which supports concurrent kernels, so a cc 2.0 or newer device
kernels that are small enough in terms of number of blocks and other resource usage (registers, shared memory) so that multiple kernels can actually execute. Kernels with larger resource requirements will typically be observed to be running serially. This is expected behavior.
proper usage of streams to enable concurrency
In addition, concurrent kernels often implies copy/compute overlap. In order for copy/compute overlap to work, you must:
be using a GPU with enough copy engines. Some GPUs have one engine, some have 2. If your GPU has one engine, you can overlap one copy operation (ie. one direction) with kernel execution. if you have 2 copy engines (your GeForce GPU has 1) you can overlap both directions of copying with kernel execution.
use pinned (host) memory for any data that will be copied to or from the GPU global memory, that will be the target (to or from) for any of the copy operations you intend to overlap
Use streams properly and the necessary async versions of the relevant api calls (e.g. cudaMemcpyAsync
Regarding your observation that the smaller 32x1024 kernels do not execute concurrently, this is likely a resource issue (blocks, registers, shared memory) preventing much overlap. If you have enough blocks in the first kernel to occupy the GPU execution resources, it's not sensible to expect additional kernels to begin executing until the first kernel is finished or mostly finished.
EDIT: Responding to question edits and additional comments below.
Yes, GTX480 has only one copy "queue" (I mentioned this explicitly in my answer, but I called it a a copy "engine"). You will only be able to get one cudaMemcpy... operation to run at any given time, and therefore only one direction (H2D or D2H) can actually be moving data at any given time, and you will only see one cudaMemcpy... operation overlap with any given kernel. And cudaStreamSynchronize causes the stream to wait until ALL CUDA operations previously issued to that stream are completed.
Note that the cudaStreamSynchronize you have in your last example should not be necessary, I don't think. Streams have 2 execution characteristics:
cuda operations (API calls, kernel calls, everything) issued to the same stream will always execute sequentially, regardless of your use of the Async API or any other considerations.
cuda operations issued to separate streams, assuming all the necessary requirements have been met, will execute asynchronously to each other.
Due to item 1, in your last case, your final "copy Data GPU->Host" operation will not begin until the previous kernel call issued to that stream is complete, even without the cudaStreamSynchronize call. So I think you can get rid of that call, i.e the 2nd case you have listed should be no different than the final case, and in the 2nd case you should not need the sleep operation either. The cudaMemcpy... issued to the same stream will not begin until all previous cuda activity in that stream is finished. This is a characteristic of streams.
EDIT2: I'm not sure we're making any progress here. The issue you pointed out in the GTC preso here (slide 21) is a valid issue, but you can't work around it by inserting additional synchronization operations, nor would a "blocking kernel" help you with that, nor is it a function of having one copy engine or 2. If you want to issue kernels in separate streams but issued in sequence with no other intervening cuda operations, then that hazard exists. The solution for this, as pointed out on the next slide, is to not issue the kernels sequentially, which is roughly comparable to your 2nd case. I'll state this again:
you have identified that your case 2 gives good concurrency
the sleep operation in that case is not needed for data integrity
If you want to provide a short sample code that demonstrates the issue, perhaps other discoveries can be made.

Concurrent: Short copy, Long kernel

When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I'm wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I'm trying to figure out how many copy operations to queue up before a kernel launch.)
One to one: (time to copy = kernel run time)
<--stream1Copy--><--stream2Copy-->
..............................<-stream1Kernel->
Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
............................<----------stream1Kernel------------>
You can have more than one copy per kernel launch. Only one copy (per direction on devices with dual copy engines) can be running at a particular time to a particular GPU, but once that one is complete, another can be started immediately. Asynchronous copies issued in streams other than the kernel launch stream in question will run completely asynchronously to that kernel launch, assuming niether stream is stream 0. (This also assumes you are using pinned memory i.e. cudaHostAlloc to create the relevant host-side buffers.)
You may want to read the relevant section in the best practices guide.
The reason you frequently see a 1:1 analysis of compute and copy is that it is assumed the copied data will be consumed by (or is produced by) the kernel call, and so logically we can think of the block of data this way. But if it's easier to structure your code as a sequence of copies, there should be no problem with that. Naturally if you can batch up all your data into a single cudaMemcpy call, that will be slightly more efficient that a sequence of copies that are transferring the same data.
The visual profiler will help you see exactly what is going on comparing data copy operations to kernel operations, in a timeline fashion.

Accessing cuda device memory when the cuda kernel is running

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.

Copying an integer from GPU to CPU

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.