CUDA timers - CPU vs. GPU? - cuda

I'm trying to understand the difference between timing kernel execution using CUDA timers (events) and regular CPU timing methods (gettimeofday on Linux, etc.).
From reading http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ section 8.1, it seems to me that the only real difference is that when using CPU timers one needs to remember to synchronize the GPU because calls are asynchronous. Presumably the CUDA event APIs do this for you.
So is this really a matter of:
With GPU events you don't need to explicitly call cudaDeviceSynchronize
With GPU events you get an inherently platform-independent timing API, while with the CPU you need to use separate APIs per OS
?
Thanks in advance

You've got it down. Because the GPU operates asynchronously from the CPU, when you launch a GPU kernel the CPU can continue on its merry way. When timing, this means you could reach the end of your timing code (i.e. record the duration) before the GPU returns from its kernel. This is why we synchronize.. to make sure the kernel has finished before we move forward with the CPU code. This is particularly important when we need the results from the GPU kernel for a following operation (i.e. steps in a algorithm).
If it helps, you can think of cudaEventSynchronize as a sync point from CPU-GPU as the CPU timer depends on both CPU and GPU code, whereas the cuda timer events only depend on the GPU code. And because those cuda timing events are compiled by nvcc specifically for CUDA platforms, they're CPU platform independent but GPU platform dependent.

Related

Is it possible for CPU to do context-switch after initiating GPU source code(cuda kernel) that is not finished?

Basically, in CPU and GPU communication, if a process running over a CPU initiates a cuda kernel, process still can issues its code if it is not dependent on the result of cuda kernel.
But, is it possible for a process running over CPU to do context-switch even after initiating a cuda kernel that is not finished?
If it is possible, what happens internally?
CPU threads can context-switch at any time, including during a cudaDeviceSynchronize() call waiting for results from a (asynchronous) kernel launch.
You can further facilitate context-switching during synchronization by calling cudaSetDeviceFlags() with the cudaDeviceScheduleYield or cudaDeviceScheduleBlockingSync flags, which will yield the processor quicker than the cudaDeviceScheduleSpin or cudaDeviceScheduleAuto settings.

CUDA: it is possible for a kernel to return a break to CPU?

I'm writing a C program using CUDA parallelization, and I was wondering if it is possible for a kernel to return a break to CPU.
My program essentially do a for loop and inside that loop I take several parallel actions; at the start of each iteration I have to take a control over a variable (measuring the improvement of the just done iteration) which resides on the GPU.
My desire is that the control over that variable returns a break to CPU in order to exit the for loop(I take the control using a trivial kernel <<<1,1>>>).
I've tried copying back that variable on the CPU and take the control on the CPU but, as I feared, it sloows down the overall execution.
Any advice?
There is presently no way for any running code on a CUDA capable GPU to preempt running code on the host CPU. So although it isn't at all obvious what you are asking about, I'm fairly certain the answer is no just because there is no host side preempt or interrupt mechanism available available in device code.
There is no connection betweeen CPU code and GPU code.
All what you can do while working with CUDA is:
From CPU side allocate memory in GPU
Copy data to GPU
Launch executioning of prewritten instructions for GPU (GPU is black box for CPU)
Read data back.
So thinking about these steps in a loop, all what is left to you is to check result and break the loop if need to.

Understanding CUDA dependency check

CUDA C Programming Guide provides the following statements:
For devices that support concurrent kernel execution and are of compute capability 3.0
or lower, any operation that requires a dependency check to see if a streamed kernel
launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any
stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel
launch being checked is complete.
I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.
Any explanation or elaboration will be appreciated!
First of all I suggest you visit the webinar-site of nvidia and watch the webinar on Concurrency & Streams.
Furthermore consider the following points:
commands issued to the same stream are treated as dependent
e.g. you would insert a kernel into a stream after a memcopy of some data this
kernel will acess. The kernel "depends" on the data being available.
commands in the same stream are therefore guaranteed to be executed sequentially (or synchronously, which is often used as synonym)
commands in different streams however are independent and can be run concurrently
so dependencies are only known to the programmer and are expressed using streams (to avoid errors)!
The following corresponds only to devices with compute capability 3.0 or lower (as stated in the quide). If you want to know more about the changes to stream scheduling behaviour with compute capability 3.5 have a look at HyperQ and the corresponding example. At this point I also want to reference this thread where I found the HyperQ examples :)
About your second question: I do not quite understand what you mean by a "kernel execution on some device memory" or a "kernel execution involving device memory" so i reduced your statement to:
A kernel execution requires a dependency check on all the previous kernels and memory transfers.
Better would be:
A CUDA operation requires a dependency check to see whether preceding CUDA operations in the same stream have completed.
I think your problem here is with the expression "started executing". That means there can still be independent (that is on different streams) kernel launches, which will be concurrent with the previous kernels, provided they have all started executing and enough device resources are available.

CUDA: CPU code in parallel to GPU code

I have a program where I do a bunch of calculations on GPU, then I do memory operations with those results on CPU, then I take the next batch if data and do the same all over. Now it would be a lot faster if I could do the first set of calculations and then start with the second batch whilst my CPU churned away at the memory operations. How would I do that?
All CUDA kernel calls (e.g. function<<<blocks, threads>>>()) are asynchronous -- they return control immediately to the calling host thread. Therefore you can always perform CPU work in parallel with GPU work just by putting the CPU work after the kernel call.
If you also need to transfer data from GPU to CPU at the same time, you will need a GPU that has the deviceOverlap field set to true (check using cudaGetDeviceProperties()), and you need to use cudaMemcpyAsync() from a separate CUDA stream.
There are examples to demonstrate this functionality in the NVIDIA CUDA SDK -- For example the "simpleStreams" and "asyncAPI" examples.
The basic idea can be something like this:
Do 1st batch of calculations on GPU
Enter a loop: {
Copy results from device mem to host mem
Do next batch of calculations in GPU (the kernel launch is assynchronous and the control returns immediately to the CPU)
Process results of the previous iteration on CPU
}
Copy results from last iteration from device mem to host mem
Process results of last iteration
You can get finer control over asynchronous work between CPU and GPU by using cudaMemcpyAsync, cudaStream and cudaEvent.
As #harrism said you need your device to support deviceOverlap to do memory transfers and execute kernels at the same time but even if it does not have that option you can at least execute a kernel asynchronously with other computations on the CPU.
edit: deviceOverlap has been deprecated, one should use asyncEngineCount property.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance