Concurrent: Short copy, Long kernel - cuda

When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I'm wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I'm trying to figure out how many copy operations to queue up before a kernel launch.)
One to one: (time to copy = kernel run time)
<--stream1Copy--><--stream2Copy-->
..............................<-stream1Kernel->
Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
............................<----------stream1Kernel------------>

You can have more than one copy per kernel launch. Only one copy (per direction on devices with dual copy engines) can be running at a particular time to a particular GPU, but once that one is complete, another can be started immediately. Asynchronous copies issued in streams other than the kernel launch stream in question will run completely asynchronously to that kernel launch, assuming niether stream is stream 0. (This also assumes you are using pinned memory i.e. cudaHostAlloc to create the relevant host-side buffers.)
You may want to read the relevant section in the best practices guide.
The reason you frequently see a 1:1 analysis of compute and copy is that it is assumed the copied data will be consumed by (or is produced by) the kernel call, and so logically we can think of the block of data this way. But if it's easier to structure your code as a sequence of copies, there should be no problem with that. Naturally if you can batch up all your data into a single cudaMemcpy call, that will be slightly more efficient that a sequence of copies that are transferring the same data.
The visual profiler will help you see exactly what is going on comparing data copy operations to kernel operations, in a timeline fashion.

Related

How do you keep data in fast GPU memory (l1/shared) across kernel invocations?

How do you keep data in fast GPU memory across kernel invocations?
Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.
One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.
Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....
Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.
However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?
First of all to observe the effects of the streams and Async copying
you definitely need to pin the host memory. Then you can observe
concurrent kernel invocations "almost" happening at the same time.
I'd rather used Async copying since it makes me feel in control of
the situation.
Secondly you could just hold on to the data in global memory and load
it in the shared memory whenever you need it. To my knowledge shared
memory is only known to the kernel itself and disposed after
termination. Try using Async copies while the kernel is running and
synchronize the streams accordingly. Don't forget to __syncthreads()
after loading to the shared memory. I hope it helps.

Running several streams (instead of threads/blocks) in parallel

I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available.
The code is like:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).
The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.
Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.
News:
I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads.
Here some observations:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- sleep(10)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- cudaStreamSynchronize(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?
Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.
Edit2:
Here an example to clarify my problem:
To keep it simple we have 2 streams:
Stream1:
------------
HostToGPU1
kernel1
GPUToHost1
Stream2:
------------
HostToGPU2
kernel2
GPUToHost2
The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool.
GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).
Concurrent kernel execution can be most easily witnessed on linux.
For a good example and an easy test, refer to the concurrent kernels sample.
Good concurrency among kernels generally requires several things:
a device which supports concurrent kernels, so a cc 2.0 or newer device
kernels that are small enough in terms of number of blocks and other resource usage (registers, shared memory) so that multiple kernels can actually execute. Kernels with larger resource requirements will typically be observed to be running serially. This is expected behavior.
proper usage of streams to enable concurrency
In addition, concurrent kernels often implies copy/compute overlap. In order for copy/compute overlap to work, you must:
be using a GPU with enough copy engines. Some GPUs have one engine, some have 2. If your GPU has one engine, you can overlap one copy operation (ie. one direction) with kernel execution. if you have 2 copy engines (your GeForce GPU has 1) you can overlap both directions of copying with kernel execution.
use pinned (host) memory for any data that will be copied to or from the GPU global memory, that will be the target (to or from) for any of the copy operations you intend to overlap
Use streams properly and the necessary async versions of the relevant api calls (e.g. cudaMemcpyAsync
Regarding your observation that the smaller 32x1024 kernels do not execute concurrently, this is likely a resource issue (blocks, registers, shared memory) preventing much overlap. If you have enough blocks in the first kernel to occupy the GPU execution resources, it's not sensible to expect additional kernels to begin executing until the first kernel is finished or mostly finished.
EDIT: Responding to question edits and additional comments below.
Yes, GTX480 has only one copy "queue" (I mentioned this explicitly in my answer, but I called it a a copy "engine"). You will only be able to get one cudaMemcpy... operation to run at any given time, and therefore only one direction (H2D or D2H) can actually be moving data at any given time, and you will only see one cudaMemcpy... operation overlap with any given kernel. And cudaStreamSynchronize causes the stream to wait until ALL CUDA operations previously issued to that stream are completed.
Note that the cudaStreamSynchronize you have in your last example should not be necessary, I don't think. Streams have 2 execution characteristics:
cuda operations (API calls, kernel calls, everything) issued to the same stream will always execute sequentially, regardless of your use of the Async API or any other considerations.
cuda operations issued to separate streams, assuming all the necessary requirements have been met, will execute asynchronously to each other.
Due to item 1, in your last case, your final "copy Data GPU->Host" operation will not begin until the previous kernel call issued to that stream is complete, even without the cudaStreamSynchronize call. So I think you can get rid of that call, i.e the 2nd case you have listed should be no different than the final case, and in the 2nd case you should not need the sleep operation either. The cudaMemcpy... issued to the same stream will not begin until all previous cuda activity in that stream is finished. This is a characteristic of streams.
EDIT2: I'm not sure we're making any progress here. The issue you pointed out in the GTC preso here (slide 21) is a valid issue, but you can't work around it by inserting additional synchronization operations, nor would a "blocking kernel" help you with that, nor is it a function of having one copy engine or 2. If you want to issue kernels in separate streams but issued in sequence with no other intervening cuda operations, then that hazard exists. The solution for this, as pointed out on the next slide, is to not issue the kernels sequentially, which is roughly comparable to your 2nd case. I'll state this again:
you have identified that your case 2 gives good concurrency
the sleep operation in that case is not needed for data integrity
If you want to provide a short sample code that demonstrates the issue, perhaps other discoveries can be made.

About cudaMemcpyAsync Function

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

Is cudaMemcpy from host to device executed in parallel?

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?
I other words, it the copy a sequential process or is it done in parallel?
Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.
If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first
I am curious if cudaMemcpy is executed on the CPU or the GPU when we
copy from host to device?
In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both. The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer. The process is repeated as many times as necessary for the complete copy from source memory to the GPU.
The throughput of process can be improved by using pinned memory, which the driver can directly DMA to or from without intermediate copying (although pinning has a large initialization/allocation overhead which needs to be amortised as well).
As to the rest of the question, I suspect that two memory copies directly from the source memory would be more efficient than the alternative, but this is the sort of question that can only be conclusively answered by benchmarking.
I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.
I doubt you'll get any performance gain from concatenating the data before transferring it. The bottleneck will likely be the transfer itself. The copies should be queued and executed sequentially with minimal overhead.

Copying an integer from GPU to CPU

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.