CPU usage calculation in time profiler instrument - xcode7

When I use time profiler in instrument, it shows the cpu usage for each core (or logical core) as well as a "cpu usage". I'm wondering how the cpu usage is calculated according to the cpu usage of each core. I tried data from a specific timestamp and it is neither sum of each core nor average. Here is a snapshot of the panel.

The CPU usage is neither the sum nor the average. In contrast to OS CPU usage (say top), profile CPU usage is generally taken from an actual hardware counter in the processor. This also makes it hardware dependent, meaning its exact meaning on an Intel processor is different from that of an AMD processor. So why are these measurements useful? Because the ratios and values are correct when compared to values taken over the same interval / at the same instant, and the average values are what you expect them to be.
When profiling, look at correlations first over intervals and then between intervals. Afterwards, zoom in on more specific registers, such as cache misses or pipeline stalls.
You might check out the Intel optimization documentation. It's pretty good in my experience. I'll post a reference in the comment section if I can find the time.
PS By the way, the "Core 4" and "Core 5 (logical)" are really not accurate above (not your fault). The names imply that the "logical" core is somehow inferior to the non-logical one. When a CPU is executing multiple hardware threads on one core, what Intel in marketing speak calls hyperthreading, there is no difference between Core 4 and Core 5 as they behave identically on the physical core -- meaning they are both "logical".

Related

Is there any way or even possible to get the overall utilization of a GPU during a period of time?

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time.
By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.
Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.
PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.
So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.
I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.
Thank you very much!
nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.
I suggest using nVidia Visual Profiler.
The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:
Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.
Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.
Check the metrics reference as it will provide you much more useful information.
I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.
Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.

Estimating increase in speed when changing NVIDIA GPU model

I am currently developing a CUDA application that will most certainly be deployed on a GPU much better than mine. Given another GPU model, how can I estimate how much faster my algorithm will run on it?
You're going to have a difficult time, for a number of reasons:
Clock rate and memory speed only have a weak relationship to code speed, because there is a lot more going on under the hood (e.g., thread context switching) that gets improved/changed for almost all new hardware.
Caches have been added to new hardware (e.g., Fermi) and unless you model cache hit/miss rates, you'll have a tough time predicting how this will affect the speed.
Floating point performance in general is very dependent on model (e.g.: Tesla C2050 has better performance than the "top of the line" GTX-480).
Register usage per device can change for different devices, and this can also affect performance; occupancy will be affected in many cases.
Performance can be improved by targeting specific hardware, so even if your algorithm is perfect for your GPU, it could be better if you optimize it for the new hardware.
Now, that said, you can probably make some predictions if you run your app through one of the profilers (such as the NVIDIA Compute Profiler), and you look at your occupancy and your SM utilization. If your GPU has 2 SMs and the one you will eventually run on has 16 SMs, then you will almost certainly see an improvement, but not specifically because of that.
So, unfortunately, it isn't easy to make the type of predictions you want. If you're writing something open source, you could post the code and ask others to test it with newer hardware, but that isn't always an option.
This can be very hard to predict for certain hardware changes and trivial for others. Highlight the differences between the two cards you're considering.
For example, the change could be as trivial as -- if I had purchased one of those EVGA water-cooled behemoths, how much better would it perform over a standard GTX 580? This is just an exercise in computing the differences in the limiting clock speed (memory or gpu clock). I've also encountered this question when wondering if I should overclock my card.
If you're going to a similar architecture, GTX 580 to Tesla C2070, you can make a similar case of differences in clock speeds, but you have to be careful of the single/double precision issue.
If you're doing something much more drastic, say going from a mobile card -- GTX 240M -- to a top of the line card -- Tesla C2070 -- then you may not get any performance improvement at all.
Note: Chris is very correct in his answer, but I wanted to stress this caution because I envision this common work path:
One says to the boss:
So I've heard about this CUDA thing... I think it could make function X much more efficient.
Boss says you can have 0.05% of work time to test out CUDA -- hey we already have this mobile card, use that.
One year later... So CUDA could get us a three fold speedup. Could I buy a better card to test it out? (A GTX 580 only costs $400 -- less than that intern fiasco...)
You spend the $$, buy the card, and your CUDA code runs slower.
Your boss is now upset. You've wasted time and money.
So what happened? Developing on an old card, think 8800, 9800, or even the mobile GTX 2XX with like 30 cores, leads one to optimize and design your algorithm in a very different way from how you would to efficiently utilize a card with 512 cores. Caveat Emptor You get what you pay for -- those awesome cards are awesome -- but your code may not run faster.
Warning issued, what's the walk away message? When you get that nicer card, be sure to invest time in tuning, testing, and possibly redesigning your algorithm from the ground up.
OK, so that said, rule of thumb? GPUs get twice as fast every six months. So if you're moving from a card that's two years old to a card that's top of the line, claim to your boss that it will run between 4 to 8 times faster (and if you get the full 16-fold improvement, bravo!!)