Running several streams (instead of threads/blocks) in parallel - cuda

I have a kernel which I want to start with the configuration "1 block x 32 threads". To increase parallelism I want to start several streams instead of running a bigger "work package" than "1 block x 32 threads". I want to use the GPU in a program where data comes from the network. I don't want to wait until a bigger "work package" is available.
The code is like:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (15 CPU threads use the GPU).
The code works but streams doesn't run concurrently as expected. The GTX 480 has 15 SMs where each SM has 32 shader processors. I expect that if I start the kernel 15 times, all 15 streams run in parallel, but this is not the case. I have used the Nvidia Visual Profiler and there is a maximum of 5 streams which run in parallel. Often only one stream runs. The performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If I use instead a "32 block x 1024 threads" configuration but two streams the streams are executed one after each other and performance drops. I am using Cuda Toolkit 5.5 and Ubuntu 12.04.
Can somebody explain why this is the case and can give me some background information? Should it work better on newer GPUs? What is the best way to use the GPU in time critically applications where you don't want to buffer data? Probably this is not possible, but I am searching for techniques which bring me closer to a solution.
News:
I did some further research. The problem is the last cudaMemcpyAsync(..) (GPU->Host copy) call. If I remove it, all streams run concurrent. I think the problem is illustrated in http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf on slide 21. They say that on Fermi there are two copy queues but this is only true for tesla and quadro cards, right? I think the problem is that the GTX 480 has only one copy queue and all copy commands (host->GPU AND GPU->host) are put in this one queue. Everything is non-blocking and the GPU->host memcopy of the first thread blocks the host->GPU memcopy calls of other threads.
Here some observations:
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- sleep(10)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> works: streams run concurrently
Thread(i=0..14) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- cudaStreamSynchronize(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
-> doesn't work!!! Maybe cudaStreamSynchronize is put in the copy-queue?
Does someone knows a solution for this problem. Something like a blocking-kernel call would be cool. The last cudaMemcpyAsync() (GPU->device) should be called if the kernel has been finished.
Edit2:
Here an example to clarify my problem:
To keep it simple we have 2 streams:
Stream1:
------------
HostToGPU1
kernel1
GPUToHost1
Stream2:
------------
HostToGPU2
kernel2
GPUToHost2
The first stream is started. HostToGPU1 is executed, kernel1 is launched and GPUToHost1 is called. GPUToHost1 blocks because kernel1 is running. In the meantime Stream2 is started. HostToGPU2 is called, Cuda puts it in the queue but it can't be executed because GPUToHost1 blocks until kernel 1 has been finished. There are no data transfers in the moment. Cuda just waits for GPUToHost1. So my idea was to call GPUToHost1 when kernel1 is finished. This seams to be the reason why it works with sleep(..) because GPUToHost1 is called when the kernel has been finished. A kernel-launch which automatically blocks the CPU-thread would be cool.
GPUToHost1 is not blocking in the queue (if there are no other data transfers at the time but in my case, data transfer are not time-consuming).

Concurrent kernel execution can be most easily witnessed on linux.
For a good example and an easy test, refer to the concurrent kernels sample.
Good concurrency among kernels generally requires several things:
a device which supports concurrent kernels, so a cc 2.0 or newer device
kernels that are small enough in terms of number of blocks and other resource usage (registers, shared memory) so that multiple kernels can actually execute. Kernels with larger resource requirements will typically be observed to be running serially. This is expected behavior.
proper usage of streams to enable concurrency
In addition, concurrent kernels often implies copy/compute overlap. In order for copy/compute overlap to work, you must:
be using a GPU with enough copy engines. Some GPUs have one engine, some have 2. If your GPU has one engine, you can overlap one copy operation (ie. one direction) with kernel execution. if you have 2 copy engines (your GeForce GPU has 1) you can overlap both directions of copying with kernel execution.
use pinned (host) memory for any data that will be copied to or from the GPU global memory, that will be the target (to or from) for any of the copy operations you intend to overlap
Use streams properly and the necessary async versions of the relevant api calls (e.g. cudaMemcpyAsync
Regarding your observation that the smaller 32x1024 kernels do not execute concurrently, this is likely a resource issue (blocks, registers, shared memory) preventing much overlap. If you have enough blocks in the first kernel to occupy the GPU execution resources, it's not sensible to expect additional kernels to begin executing until the first kernel is finished or mostly finished.
EDIT: Responding to question edits and additional comments below.
Yes, GTX480 has only one copy "queue" (I mentioned this explicitly in my answer, but I called it a a copy "engine"). You will only be able to get one cudaMemcpy... operation to run at any given time, and therefore only one direction (H2D or D2H) can actually be moving data at any given time, and you will only see one cudaMemcpy... operation overlap with any given kernel. And cudaStreamSynchronize causes the stream to wait until ALL CUDA operations previously issued to that stream are completed.
Note that the cudaStreamSynchronize you have in your last example should not be necessary, I don't think. Streams have 2 execution characteristics:
cuda operations (API calls, kernel calls, everything) issued to the same stream will always execute sequentially, regardless of your use of the Async API or any other considerations.
cuda operations issued to separate streams, assuming all the necessary requirements have been met, will execute asynchronously to each other.
Due to item 1, in your last case, your final "copy Data GPU->Host" operation will not begin until the previous kernel call issued to that stream is complete, even without the cudaStreamSynchronize call. So I think you can get rid of that call, i.e the 2nd case you have listed should be no different than the final case, and in the 2nd case you should not need the sleep operation either. The cudaMemcpy... issued to the same stream will not begin until all previous cuda activity in that stream is finished. This is a characteristic of streams.
EDIT2: I'm not sure we're making any progress here. The issue you pointed out in the GTC preso here (slide 21) is a valid issue, but you can't work around it by inserting additional synchronization operations, nor would a "blocking kernel" help you with that, nor is it a function of having one copy engine or 2. If you want to issue kernels in separate streams but issued in sequence with no other intervening cuda operations, then that hazard exists. The solution for this, as pointed out on the next slide, is to not issue the kernels sequentially, which is roughly comparable to your 2nd case. I'll state this again:
you have identified that your case 2 gives good concurrency
the sleep operation in that case is not needed for data integrity
If you want to provide a short sample code that demonstrates the issue, perhaps other discoveries can be made.

Related

Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.

Concurrent: Short copy, Long kernel

When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I'm wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I'm trying to figure out how many copy operations to queue up before a kernel launch.)
One to one: (time to copy = kernel run time)
<--stream1Copy--><--stream2Copy-->
..............................<-stream1Kernel->
Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
............................<----------stream1Kernel------------>
You can have more than one copy per kernel launch. Only one copy (per direction on devices with dual copy engines) can be running at a particular time to a particular GPU, but once that one is complete, another can be started immediately. Asynchronous copies issued in streams other than the kernel launch stream in question will run completely asynchronously to that kernel launch, assuming niether stream is stream 0. (This also assumes you are using pinned memory i.e. cudaHostAlloc to create the relevant host-side buffers.)
You may want to read the relevant section in the best practices guide.
The reason you frequently see a 1:1 analysis of compute and copy is that it is assumed the copied data will be consumed by (or is produced by) the kernel call, and so logically we can think of the block of data this way. But if it's easier to structure your code as a sequence of copies, there should be no problem with that. Naturally if you can batch up all your data into a single cudaMemcpy call, that will be slightly more efficient that a sequence of copies that are transferring the same data.
The visual profiler will help you see exactly what is going on comparing data copy operations to kernel operations, in a timeline fashion.

When to call cudaDeviceSynchronize?

when is calling to the cudaDeviceSynchronize function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with cudaMemcpy?
If I do matrix multiplications like
C = A * B
D = C * F
should I put cudaDeviceSynchronize between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes
So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).
cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.
One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
When you want your GPU to start processing some data, you typically do a kernal invocation.
When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.
In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.
You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).
From this post CUDA Dynamic Parallelism API and Principles:
If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.
...
Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
__device__ int v = 0;
__global__ void child_k(void) {
printf("v = %d\n", v);
}
__global__ void parent_k(void) {
v = 1;
child_k <<< 1, 1 >>>> ();
v = 2; // RACE CONDITION
cudaDeviceSynchronize();
}

Accessing cuda device memory when the cuda kernel is running

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.

CUDA: CPU code in parallel to GPU code

I have a program where I do a bunch of calculations on GPU, then I do memory operations with those results on CPU, then I take the next batch if data and do the same all over. Now it would be a lot faster if I could do the first set of calculations and then start with the second batch whilst my CPU churned away at the memory operations. How would I do that?
All CUDA kernel calls (e.g. function<<<blocks, threads>>>()) are asynchronous -- they return control immediately to the calling host thread. Therefore you can always perform CPU work in parallel with GPU work just by putting the CPU work after the kernel call.
If you also need to transfer data from GPU to CPU at the same time, you will need a GPU that has the deviceOverlap field set to true (check using cudaGetDeviceProperties()), and you need to use cudaMemcpyAsync() from a separate CUDA stream.
There are examples to demonstrate this functionality in the NVIDIA CUDA SDK -- For example the "simpleStreams" and "asyncAPI" examples.
The basic idea can be something like this:
Do 1st batch of calculations on GPU
Enter a loop: {
Copy results from device mem to host mem
Do next batch of calculations in GPU (the kernel launch is assynchronous and the control returns immediately to the CPU)
Process results of the previous iteration on CPU
}
Copy results from last iteration from device mem to host mem
Process results of last iteration
You can get finer control over asynchronous work between CPU and GPU by using cudaMemcpyAsync, cudaStream and cudaEvent.
As #harrism said you need your device to support deviceOverlap to do memory transfers and execute kernels at the same time but even if it does not have that option you can at least execute a kernel asynchronously with other computations on the CPU.
edit: deviceOverlap has been deprecated, one should use asyncEngineCount property.