Question
total GPU time + total CPU overhead is smaller than the total execution time. Why?
Detail
I am studying how frequent global memory access and kernel launch may affect the performance and I have designed a code which has multiple small kernels and ~0.1 million kernel calls in total. Each kernel reads data from global memory, processes them and then writes back to the global memory. As expected, the code runs much slower than the original design which has only one large kernel and very few kernel launches.
The problem arose as I used command line profiler to get "gputime" (execution time for the GPU kernel or memory copy method) and "cputime" (CPU overhead for non-blocking method, the sum of gputime and CPU overhead for blocking method ). To my understanding, the sum of all gputimes and all cputimes should exceed the entire execution time (the last "gpuendtimestamp" minus the first "gpustarttimestamp"), but it turns out the contrary is true (sum of gputimes=13.835064 s,
sum of cputimes=4.547344 s, total time=29.582793). Between the end of one kernel and the start of the next, there is often a large amount of waiting time, larger than the CPU overhead of the next kernel. Most of the kernels suffer from this problem are: memcpyDtoH, memcpyDtoD and thrust internel functions such as launch_closure_by_value, fast_scan, etc. What is the probable reason?
System
Windows 7, TCC driver, VS 2010, CUDA 4.2
Thanks for your help!
This is possibly a combination of profiling, which increases latency, and the Windows WDDM subsystem. To overcome the high latency of the latter, the CUDA driver batches GPU operations and submits them in groups with a single Windows kernel call. This can cause large periods of GPU inactivity if CUDA API commands are sitting in an unsubmitted batch.
(Copied #talonmies' comment to an answer, to enable voting and accepting.)
Related
I have a set of CUDA kernels. Each kernel completes its job in less than 10 microsec, however, its launch time is 50-70 microsec. I am suspecting the use of texture memory might be the reason, since it is used in my kernels.
Are there any recommendations to reduce the launch time of CUDA kernels? In general, what are the factors that affect the kernel launch time?
You can reduce the overall launch time by launching fewer kernels; e.g. if you launch several kernels in sequence, you could write a new single kernel that does all of that work in a single launch.
From the very little bit of context currently in the question, I suspect this is your problem; you are doing too little work per kernel.
(my next guess is an error in benchmark; i.e. the times aren't for what you think they are)
I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
Most of the time is running kernels. There is little memory overhead.
All kernels take the same time (good)
The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
Memory dependency. Too many texture memreads? But I need this amount of memreads.
Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
Instruction fetch (????)
Questions:
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Is there a ways to profile in the instruction level inside the kernel?
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
If I were to start trying to optimize the kernel, where would I start?
Are there more additional tests I can perform to understand better my
kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
'Kernel Profile - Instruction Execution'
'Kernel Profile - PC Sampling' (Only in Maxwell)
Are there more conclusions one can obtain by looking at the profiling
than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.
I would like to ask you a question about the concurrent kernel execution in Nvidia GPUs. I explain us my situation. I have an code which launchs 1 sparse matrix multiplication for 2 different matrix (one for each one). These matrix multiplications are performed with the cuSPARSE Library. I want both operations can be concurrently performed, so I use 2 streams to launch them. With Nvidia Visual profiler, I´ve observed that both operations (cuSPARSE kernels) are completely overlaped. The time stamps for both kernels are:
Kernel 1) Start Time: 206,205 ms - End Time: 284,177 ms.
Kernel 2) Start Time: 263,519 ms - End Time: 278,916 ms.
I´m using a Tesla K20c with 13 SMs which can execute up 16 blocks per SM. Both kernels have 100% occupancy and launch an enough amount of blocks:
Kernel 1) 2277 blocks, 32 Register/Thread, 1,156 KB shared memory.
Kernel 2) 46555 blocks, 32 Register/Thread, 1,266 KB shared memory.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU. However, Nvidia Visual Profiler shows that these kernels are being overlaped. Why?. Anyone could explain me why this behaviour can occur?
Many thanks in advance.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU.
I think this is an incorrect statement. As far as I know, the low-level scheduling behavior of blocks is not specified in CUDA.
Newer devices (cc3.5+ with hyper-Q) can more easily schedule blocks from concurrent kernels at the same time.
So, if you launch 2 kernels (A and B), each with large numbers of blocks, concurrently, then you may observe either
blocks from kernel A execute concurrently with kernel B
all (or nearly all) of the blocks of kernel A execute before kernel B
all (or nearly all) of the blocks of kernel B execute before kernel A
Since there is no specification at this level, there is no direct answer. Any of the above are possible. The low level block scheduler is free to choose blocks in any order, and the order is not specified.
If a given kernel launch "completely saturates" the machine (i.e. uses enough resources to fully occupy the machine while it is executing) then there is no reason to think that the machine has extra capacity for a second concurrent kernel. Therefore there would be no reason to expect much, if any, speed up from running the two kernels concurrently as opposed to sequentially. In such a scenario, whether they execute concurrently or not, we would expect the total execution time for the 2 kernels running concurrently to be approximately the same as the total execution time if the two kernels are launched or scheduled sequentially (ignoring tail effects and launch overheads, and the like).
I have a kernel that runs on my GPU (GeForce 690) and uses a single block. It runs in about 160 microseconds. My plan is to launch 8 of these kernels separately, each of which only uses a single block, so each would run on a separate SM, and then they'd all run concurrently, hopefully in about 160 microseconds.
However, when I do that, the total time increases linearly with each kernel: 320 microseconds if I run 2 kernels, about 490 microseconds for 3 kernels, etc.
My question: Do I need to set any flag somewhere to get these kernels to run concurrently? Or do I have to do something that isn't obvious?
As #JackOLantern indicated concurrent kernels require the usage of streams, which are required for all forms of asynchronous activity scheduling on the GPU. It also requires a GPU of compute capability 2.0 or greater, generally speaking. If you do not use streams in your application, all cuda API and kernel calls will be executed sequentially, in the order in which they were issued in the code, with no overlap from one call/kernel to the next.
Rather than give a complete tutorial here, please review the concurrent kernels cuda sample that JackOlantern referenced.
Also note that actually witnessing concurrent execution can be more difficult on windows, for a variety of reasons. If you run the concurrent kernels sample, it will indicate pretty quickly if the environment you are in (OS, driver, etc.) is providing concurrent execution.
I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.