'Flush records'-Warning in Parallel Nsight profiling results - cuda

I'm trying to profile my CUDA-Kernels running on a Windows 7 32 bit machine with a NVIDIA GTX 480 board. I'm using the CUDA 4.1 32 bit toolkit and the Parallel Nsight 2.1 edition for VS 2010.
The profiling results of my program always show the same warning on an irregular basis:
Message: Flush records, Event Type: Range, Level: 50
After this event there is always a processing break of several milliseconds. Then the GPU proceeds the computing at the speed it had before.
I havn't found any information about this warning in CUDA documentation and on the web and I don't even know if it is a problem that only occours during profiling.
Has anyone an idea what this warning is about and how to avoid it?

The warning "Flush Record" is used to show when the Nsight CUDA Trace Activity is adding additional overhead to your application. This is to allow you to interpret periods of high CPU activity. There is no way to remove this warning. Your application is not doing anything wrong.
The Nsight CUDA Trace Activity collects timestamps for the start and end of GPU work including kernels launches, memory copies, and memory sets. When an application launches a task on the GPU the tool allocates a trace record for the task and programs the GPU to write a time stamp into the record. The collection of timestamps is done in a way that should not break concurrency and should not stall the CPU. When the work is completed the tools collects the information and streams it to memory. The Flush range includes the time to collect the results and write out the information. This can include time to perform additional kernel launches and copy memory from device to host. The tool will collect results when the application synchronizes a context (cuCtxSynchronize or cuda{Thread, Device}Synchronize) or when it runs out of trace records.
I will enter a bug to improve the user documentation and tool tips.

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Can execution of CUDA kernels from two contexts overlap?

From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.

Difference in time reported by NVVP and counters

I've been running kernel of CUDA programs. I observe that there is considerable difference between time reported by GPU counters and NVVP for kernel execution. Why such difference is usually observed?
Nsight Visual Studio Edition and the Visual Profiler support two mechanism for capturing the duration of the kernel. Both of these methods will result in a value smaller and more accurate than what is reported by CUevent/cudaEvent. The methods are as follows:
Concurrent Kernel Timing
This is the default mode used by Nsight 2.x and Visual Profiler 5.0 to generate a timeline. The duration of a kernel is defined as the time from when the kernel code starts executing on the device to the time that it completes. This cannot be measured using CUDA events.
Serialized Kernel Timing
This is the default mode used by tools when collecting PM counters for each kernel. The duration of a kernel is defined as the time the GPU processes the launch request until the GPU idles after completion of the kernel. This mode specifically disables concurrent kernel execution. In almost all cases the reported duration will be slightly larger than the concurrent kernel trace duration as it includes time for the GPU to launch the first block and time for the GPU to complete all memory stores.
CUDA Event Range Timing
CUDA event timing is done by calling cu/cudaEventRecord before and after the kernel launch on the same stream. Each event record inserts a command into the GPU push buffer. When the command reaches the GPU it writes a timestamp to memory. It is possible to push two event records without a launch. This allows a developer to measure the GPU time between the two timestamp commands. This method has the following disadvantages and it is why I encourage developers to use the tools (Nsight, Visual Profiler, and CUPTI):
a. The elapsed time between submitting the start event record and the launch can be affected by CPU overhead. Launch overhead is 5-8µs on Linux/TCC and potentially much higher on WDDM.
b. The GPU can context switch between the start event record and the kernel execution.
c. The start event record will include launch overhead including time to update driver buffers that need to be resized, copy parameters, copy texture bindings, ...
d. The elapsed time between submitting the kernel and the end event record can impact the timing.
e. The GPU can context switch between the end of the kernel execution and the end event record.
f. Incorrect use of events will break concurrent kernel execution.
The duration provide in each of these modes will provide different values. Furthermore the definition of duration provided by tools and those available through use of events is different.
The NVIDIA tools define duration as best as possible as the time from when the GPU starts working on the kernel to when the GPU completes work on the kernel. If a developer is interested in collecting this information they should look at the CUPTI SDK included with the toolkit.

Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.