Difference in time reported by NVVP and counters - cuda

I've been running kernel of CUDA programs. I observe that there is considerable difference between time reported by GPU counters and NVVP for kernel execution. Why such difference is usually observed?

Nsight Visual Studio Edition and the Visual Profiler support two mechanism for capturing the duration of the kernel. Both of these methods will result in a value smaller and more accurate than what is reported by CUevent/cudaEvent. The methods are as follows:
Concurrent Kernel Timing
This is the default mode used by Nsight 2.x and Visual Profiler 5.0 to generate a timeline. The duration of a kernel is defined as the time from when the kernel code starts executing on the device to the time that it completes. This cannot be measured using CUDA events.
Serialized Kernel Timing
This is the default mode used by tools when collecting PM counters for each kernel. The duration of a kernel is defined as the time the GPU processes the launch request until the GPU idles after completion of the kernel. This mode specifically disables concurrent kernel execution. In almost all cases the reported duration will be slightly larger than the concurrent kernel trace duration as it includes time for the GPU to launch the first block and time for the GPU to complete all memory stores.
CUDA Event Range Timing
CUDA event timing is done by calling cu/cudaEventRecord before and after the kernel launch on the same stream. Each event record inserts a command into the GPU push buffer. When the command reaches the GPU it writes a timestamp to memory. It is possible to push two event records without a launch. This allows a developer to measure the GPU time between the two timestamp commands. This method has the following disadvantages and it is why I encourage developers to use the tools (Nsight, Visual Profiler, and CUPTI):
a. The elapsed time between submitting the start event record and the launch can be affected by CPU overhead. Launch overhead is 5-8µs on Linux/TCC and potentially much higher on WDDM.
b. The GPU can context switch between the start event record and the kernel execution.
c. The start event record will include launch overhead including time to update driver buffers that need to be resized, copy parameters, copy texture bindings, ...
d. The elapsed time between submitting the kernel and the end event record can impact the timing.
e. The GPU can context switch between the end of the kernel execution and the end event record.
f. Incorrect use of events will break concurrent kernel execution.
The duration provide in each of these modes will provide different values. Furthermore the definition of duration provided by tools and those available through use of events is different.
The NVIDIA tools define duration as best as possible as the time from when the GPU starts working on the kernel to when the GPU completes work on the kernel. If a developer is interested in collecting this information they should look at the CUPTI SDK included with the toolkit.

Related

Can execution of CUDA kernels from two contexts overlap?

From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document

CUDA timers - CPU vs. GPU?

I'm trying to understand the difference between timing kernel execution using CUDA timers (events) and regular CPU timing methods (gettimeofday on Linux, etc.).
From reading http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ section 8.1, it seems to me that the only real difference is that when using CPU timers one needs to remember to synchronize the GPU because calls are asynchronous. Presumably the CUDA event APIs do this for you.
So is this really a matter of:
With GPU events you don't need to explicitly call cudaDeviceSynchronize
With GPU events you get an inherently platform-independent timing API, while with the CPU you need to use separate APIs per OS
?
Thanks in advance
You've got it down. Because the GPU operates asynchronously from the CPU, when you launch a GPU kernel the CPU can continue on its merry way. When timing, this means you could reach the end of your timing code (i.e. record the duration) before the GPU returns from its kernel. This is why we synchronize.. to make sure the kernel has finished before we move forward with the CPU code. This is particularly important when we need the results from the GPU kernel for a following operation (i.e. steps in a algorithm).
If it helps, you can think of cudaEventSynchronize as a sync point from CPU-GPU as the CPU timer depends on both CPU and GPU code, whereas the cuda timer events only depend on the GPU code. And because those cuda timing events are compiled by nvcc specifically for CUDA platforms, they're CPU platform independent but GPU platform dependent.

Understanding CUDA dependency check

CUDA C Programming Guide provides the following statements:
For devices that support concurrent kernel execution and are of compute capability 3.0
or lower, any operation that requires a dependency check to see if a streamed kernel
launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any
stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel
launch being checked is complete.
I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.
Any explanation or elaboration will be appreciated!
First of all I suggest you visit the webinar-site of nvidia and watch the webinar on Concurrency & Streams.
Furthermore consider the following points:
commands issued to the same stream are treated as dependent
e.g. you would insert a kernel into a stream after a memcopy of some data this
kernel will acess. The kernel "depends" on the data being available.
commands in the same stream are therefore guaranteed to be executed sequentially (or synchronously, which is often used as synonym)
commands in different streams however are independent and can be run concurrently
so dependencies are only known to the programmer and are expressed using streams (to avoid errors)!
The following corresponds only to devices with compute capability 3.0 or lower (as stated in the quide). If you want to know more about the changes to stream scheduling behaviour with compute capability 3.5 have a look at HyperQ and the corresponding example. At this point I also want to reference this thread where I found the HyperQ examples :)
About your second question: I do not quite understand what you mean by a "kernel execution on some device memory" or a "kernel execution involving device memory" so i reduced your statement to:
A kernel execution requires a dependency check on all the previous kernels and memory transfers.
Better would be:
A CUDA operation requires a dependency check to see whether preceding CUDA operations in the same stream have completed.
I think your problem here is with the expression "started executing". That means there can still be independent (that is on different streams) kernel launches, which will be concurrent with the previous kernels, provided they have all started executing and enough device resources are available.

'Flush records'-Warning in Parallel Nsight profiling results

I'm trying to profile my CUDA-Kernels running on a Windows 7 32 bit machine with a NVIDIA GTX 480 board. I'm using the CUDA 4.1 32 bit toolkit and the Parallel Nsight 2.1 edition for VS 2010.
The profiling results of my program always show the same warning on an irregular basis:
Message: Flush records, Event Type: Range, Level: 50
After this event there is always a processing break of several milliseconds. Then the GPU proceeds the computing at the speed it had before.
I havn't found any information about this warning in CUDA documentation and on the web and I don't even know if it is a problem that only occours during profiling.
Has anyone an idea what this warning is about and how to avoid it?
The warning "Flush Record" is used to show when the Nsight CUDA Trace Activity is adding additional overhead to your application. This is to allow you to interpret periods of high CPU activity. There is no way to remove this warning. Your application is not doing anything wrong.
The Nsight CUDA Trace Activity collects timestamps for the start and end of GPU work including kernels launches, memory copies, and memory sets. When an application launches a task on the GPU the tool allocates a trace record for the task and programs the GPU to write a time stamp into the record. The collection of timestamps is done in a way that should not break concurrency and should not stall the CPU. When the work is completed the tools collects the information and streams it to memory. The Flush range includes the time to collect the results and write out the information. This can include time to perform additional kernel launches and copy memory from device to host. The tool will collect results when the application synchronizes a context (cuCtxSynchronize or cuda{Thread, Device}Synchronize) or when it runs out of trace records.
I will enter a bug to improve the user documentation and tool tips.

Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.