Is there any way or even possible to get the overall utilization of a GPU during a period of time? - cuda

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time.
By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.
Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.
PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.
So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.
I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.
Thank you very much!

nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.
I suggest using nVidia Visual Profiler.
The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:
Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.
Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.
Check the metrics reference as it will provide you much more useful information.
I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.
Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.

Related

CPU usage calculation in time profiler instrument

When I use time profiler in instrument, it shows the cpu usage for each core (or logical core) as well as a "cpu usage". I'm wondering how the cpu usage is calculated according to the cpu usage of each core. I tried data from a specific timestamp and it is neither sum of each core nor average. Here is a snapshot of the panel.
The CPU usage is neither the sum nor the average. In contrast to OS CPU usage (say top), profile CPU usage is generally taken from an actual hardware counter in the processor. This also makes it hardware dependent, meaning its exact meaning on an Intel processor is different from that of an AMD processor. So why are these measurements useful? Because the ratios and values are correct when compared to values taken over the same interval / at the same instant, and the average values are what you expect them to be.
When profiling, look at correlations first over intervals and then between intervals. Afterwards, zoom in on more specific registers, such as cache misses or pipeline stalls.
You might check out the Intel optimization documentation. It's pretty good in my experience. I'll post a reference in the comment section if I can find the time.
PS By the way, the "Core 4" and "Core 5 (logical)" are really not accurate above (not your fault). The names imply that the "logical" core is somehow inferior to the non-logical one. When a CPU is executing multiple hardware threads on one core, what Intel in marketing speak calls hyperthreading, there is no difference between Core 4 and Core 5 as they behave identically on the physical core -- meaning they are both "logical".

Interpreting NVIDIA Visual Profiler outputs

I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
Most of the time is running kernels. There is little memory overhead.
All kernels take the same time (good)
The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
Memory dependency. Too many texture memreads? But I need this amount of memreads.
Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
Instruction fetch (????)
Questions:
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Is there a ways to profile in the instruction level inside the kernel?
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
If I were to start trying to optimize the kernel, where would I start?
Are there more additional tests I can perform to understand better my
kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
'Kernel Profile - Instruction Execution'
'Kernel Profile - PC Sampling' (Only in Maxwell)
Are there more conclusions one can obtain by looking at the profiling
than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

Nvprof, metrics, the elapsed_cycles variable

I'm having some issues with the CUDA nvprof profiler. Some of the metrics on the site are named differently than in the profiler, and the variables don't seem to be explained anywhere on the site, or for that matter anywhere on the web (I wasn't able to find any valid reference).
I decoded most of those (here: calculating gst_throughput and gld_throughput with nvprof), but I'm still not sure about:
elapsed_cycles
max_warps_per_sm
Anyone knows precisely how to count those?
I'm trying to use the nvprof to assess some 6000 different kernels via cmdline, so it is not really viable for me to use the visual profiler.
Any help appreciated. Thanks very much!
EDIT:
What I'm using:
CUDA 5.0, GTX480 which is cc. 2.0.
What I've already done:
I've made a script that gets the formulas for each of the metrics from the profiler documentation site, resolves dependencies for any given metric, extracts those through nvprof and then counts the results from those. This involved using a (rather large) sed script that changes all the occurrences of variables that appear on the site to the ones with the same meaning that are actually accepted by the profiler. Basically I've emulated grepping metrics via nvprof. I'm just having problems with those:
Why there is a problem with those concrete variables:
max_warps_per_sm - If it is the bound of the cc or another metric/event which I am perhaps somehow missing and is specific for my program (wouldn't be a surprise as some of the variables in the profiler documentation have 3 (!) different names all for the same thing).
elapsed_cycles - I don't have elapsed_cycles in the output of nvprof --query-events. Not even anything containing the words "elapse" and the only one containing "cycle" is "active_cycles". Could that be it ? Is there any other way to count it? Is there any harm done in using "gputime" instead of this variable ? I don't need absolute numbers, I'm using it to find correlations and analyze code so if "gputime"= "elapsed_cycles" * CONSTANT, I'm perfectly okay with that.
You can use the following command that lists all the events available on each device:
nvprof --query-events
This is not very complete, but it's a good start to understand what these events/metrics are. For instance, with CUDA 5.0 and a CC 3.0 GPU, we get:
elapsed_cycles_sm: Elapsed clocks
elapsed_cycles_sm is the number of elapsed clock cycles per multiprocessor. If you want to measure this metric for your program:
nvprof --events elapsed_cycles_sm ./your_program
max_warps_per_sm is quite straightforward: this is the maximum number of resident warps per multiprocessor. This value depends on the Compute Capability (see the chart here). This is a hardware limit, no matter what your kernels are, at any given time, you will never have more resident warps per multiprocessor than this value.
Also, more information is available in the profiler's online documentation, with descriptions and formulae.
UPDATE
According to this answer:
active_cycles: Number of cycles a multiprocessor has at least one active warp.