Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code?
I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
__global__ void parent(int * atomicSignalPause, int * atomicSignalExit, Parameters * prm)
{
int progress = 0;
while(checkSignalExit(atomicSignalExit) && progress<100)
{
while(checkSignalPause(atomicSignalPause))
{
child<<<X,Y>>>(prm,progress++);
cudaDeviceSynchronize();
}
}
}
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
// producer kernel
// N: number of works that make gpu fully utilized
while(hasWork)
{
// concurrency is a global parameter
while(checkConcurrencyAtomic(concurrency)<N)
{
incrementConcurrencyAtomic(concurrency);
// a "consumer" parent kernel will get items from queue
// it will decrement concurrency when a work is done
bool success = myQueue.tryPush(work, concurrency);
if(success)
{
// update status of whole work or signal the host
}
}
// synchronization once per ~N work
cudaDeviceSynchronize();
... then check for pause signals and other tasks
}
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.
Related
I've got a CUDA kernel that is called many times (1 million is not the limit). Whether we launch kernel again or not depends on flag (result_found), that our kernel returns.
for(int i = 0; i < 1000000 /* for example */; ++i) {
kernel<<<blocks, threads>>>( /*...*/, dev_result_found);
cudaMemcpy(&result_found, dev_result_found, sizeof(bool), cudaMemcpyDeviceToHost);
if(result_found) {
break;
}
}
The profiler says that cudaMemcpy takes much more time to execute, than actual kernel call (cudaMemcpy: ~88us, cudaLaunch: ~17us).
So, the questions are:
1) Is there any way to avoid calling cudaMemcpy here?
2) Why is it so slow after all? Passing parameters to the kernel (cudaSetupArgument) seems very fast (~0.8 us), while getting the result back is slow. If I remove cudaMemcpy, my program finishes a lot faster, so I think that it's not because of synchronization issues.
1) Is there any way to avoid calling cudaMemcpy here?
Yes. This is a case where dynamic parallelism may help. If your device supports it you can move the entire loop over i on to the GPU and launch further kernels from the GPU. The launching thread can then directly read dev_result_found and return if it has finished. This completely removes cudaMemcpy.
An alternative would be to greatly reduce the number of cudaMemcpy calls. At the start of each kernel launch check against dev_result_found. If it is true, return. This way you only need to perform the memcpy every x iterations. While you will launch more kernels than you need to, these will be very cheap as the excess will return immediately.
I suspect a combination of the two methods will give best performance.
2) Why is it so slow after all?
Hard to say. I'd suggest your numbers may be a bit suspicious - I guess you're using the API trace from the profiler. This measures time as seen by the CPU, so if you launch an asynchronous call (kernel launch) followed by a a sychronous call (cudaMemcpy) the cost of synchronisaiton will be measured with the memcpy.
Still, if your kernel is relatively quick-running the overhead of the copy may be significant. You are also unable to hide any launch overheads, as you cannot schedule the next launch asynchronously.
_global__ ChildKernel1(void* data){
//Operate on data
}
_global__ ChildKernel2(void* data){
//Operate on data
}
_global__ ChildKernel3(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel1<<<16, 2>>>(data);
ChildKernel2<<<64, 3>>>(data);
ChildKernel3<<<32, 4>>>(data);
ChildKernel4<<<16, 5>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
I want to run all the child kernels in parallel. So what do I do?
Have you read the dynamic parallelism section of the programming guide?
As you've shown it, each thread in your ParentKernel will execute the code consisting of the 4 calls to child kernels. This complicates the answer.
So, with respect to the individual threads, yes, the various kernels may run in parallel, between threads.
But remember that cuda kernels issued by a given thread into the same stream will be serialized. Therefore, with respect to each individual thread in ParentKernel, the individual child kernels launched from that thread will be serialized.
To get the kernels in a single thread to have the possibility to run in parallel, launch them into separate streams.
Finally, the big caveat: Just like asynchronous concurrent execution of kernels launched from the host, device side kernels may only run "in parallel" as resources permit. There is no guarantee of parallel execution of kernels.
I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:
__global__ void warmStart(int* f)
{
*f = 0;
}
which is launched before the kernels I want to time as follows:
int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");
I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.
The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.
Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.
So, what's the difference of creating a context in these two ways, using a kernel and using an API call?
I run the test in a GTX480, using CUDA 4.0 under Linux.
Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.
when is calling to the cudaDeviceSynchronize function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with cudaMemcpy?
If I do matrix multiplications like
C = A * B
D = C * F
should I put cudaDeviceSynchronize between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes
So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).
cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.
One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
When you want your GPU to start processing some data, you typically do a kernal invocation.
When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.
In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.
You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).
From this post CUDA Dynamic Parallelism API and Principles:
If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.
...
Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
__device__ int v = 0;
__global__ void child_k(void) {
printf("v = %d\n", v);
}
__global__ void parent_k(void) {
v = 1;
child_k <<< 1, 1 >>>> ();
v = 2; // RACE CONDITION
cudaDeviceSynchronize();
}
There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?
Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..
No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.
There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.