I'm having some issues with the CUDA nvprof profiler. Some of the metrics on the site are named differently than in the profiler, and the variables don't seem to be explained anywhere on the site, or for that matter anywhere on the web (I wasn't able to find any valid reference).
I decoded most of those (here: calculating gst_throughput and gld_throughput with nvprof), but I'm still not sure about:
elapsed_cycles
max_warps_per_sm
Anyone knows precisely how to count those?
I'm trying to use the nvprof to assess some 6000 different kernels via cmdline, so it is not really viable for me to use the visual profiler.
Any help appreciated. Thanks very much!
EDIT:
What I'm using:
CUDA 5.0, GTX480 which is cc. 2.0.
What I've already done:
I've made a script that gets the formulas for each of the metrics from the profiler documentation site, resolves dependencies for any given metric, extracts those through nvprof and then counts the results from those. This involved using a (rather large) sed script that changes all the occurrences of variables that appear on the site to the ones with the same meaning that are actually accepted by the profiler. Basically I've emulated grepping metrics via nvprof. I'm just having problems with those:
Why there is a problem with those concrete variables:
max_warps_per_sm - If it is the bound of the cc or another metric/event which I am perhaps somehow missing and is specific for my program (wouldn't be a surprise as some of the variables in the profiler documentation have 3 (!) different names all for the same thing).
elapsed_cycles - I don't have elapsed_cycles in the output of nvprof --query-events. Not even anything containing the words "elapse" and the only one containing "cycle" is "active_cycles". Could that be it ? Is there any other way to count it? Is there any harm done in using "gputime" instead of this variable ? I don't need absolute numbers, I'm using it to find correlations and analyze code so if "gputime"= "elapsed_cycles" * CONSTANT, I'm perfectly okay with that.
You can use the following command that lists all the events available on each device:
nvprof --query-events
This is not very complete, but it's a good start to understand what these events/metrics are. For instance, with CUDA 5.0 and a CC 3.0 GPU, we get:
elapsed_cycles_sm: Elapsed clocks
elapsed_cycles_sm is the number of elapsed clock cycles per multiprocessor. If you want to measure this metric for your program:
nvprof --events elapsed_cycles_sm ./your_program
max_warps_per_sm is quite straightforward: this is the maximum number of resident warps per multiprocessor. This value depends on the Compute Capability (see the chart here). This is a hardware limit, no matter what your kernels are, at any given time, you will never have more resident warps per multiprocessor than this value.
Also, more information is available in the profiler's online documentation, with descriptions and formulae.
UPDATE
According to this answer:
active_cycles: Number of cycles a multiprocessor has at least one active warp.
Related
I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.
GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof to generate flop_count_sp_* metrics.
So there are two questions that I want to ask:
How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.
================ Update =======================
For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results:
nvvp results
You can use nVIDIA's NVTX library to programmatically mark named ranges or points on your timeline. The length of such a range, properly defined, would constitute your "elapsed time", and would show up very clearly in the nvvp visualization tool. Here is a "CUDA pro tip" blog post about doing this:
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
and if you want to do this in a more C++-friendly and RAII way, you can use my CUDA runtime API wrappers, which offer a scoped range marker and other utility functions. Of course, with me being the author, take my recommendation with a grain of salt and see what works for you.
About the "Elapsed time" for the session - that's the time between when you start and stop profiling activity. That can either be when the process comes up, or when you explicitly have profiling start. In my own API wrappers, there's a RAII class for that as well: cuda::profiling::scope or of course you can use the C-style API calls explicitly. (I should really write a sample program doing this, I haven't gotten around to that yet, unfortunately).
background:
I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besides it all our timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.
question:
Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.
The only way to time execution from entirely within a kernel is to use the clock() and clock64() functions that are covered in the programming guide.
Since these functions sample a per-multiprocessor counter, and AFAIK there is no specified relationship between these counters from one SM to the next, there is no way to determine using these functions alone, which thread/warp/block is "first" to execute and which is "last" to execute, assuming your GPU has more than 1 SM. (Even if there were a specified relationship, such as "they are all guaranteed to be the same value on any given cycle", you would still need additional scaffolding as mentioned below.)
While you could certainly create some additional scaffolding in your code to try to come up with an overall execution time (perhaps adding in atomics to figure out which thread/warp/block is first and last), there may still be functional gaps in the method. Given the difficulty, it seems that the best method, based on what you've described, is simply to use the profilers as discussed by #njuffa in the comments. Any of the profilers can provide you with the execution time of a kernel, on any supported platform, with a trivial set of commands.
Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.
I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time.
By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.
Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.
PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.
So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.
I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.
Thank you very much!
nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.
I suggest using nVidia Visual Profiler.
The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:
Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.
Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.
Check the metrics reference as it will provide you much more useful information.
I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.
Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.
Is it possible to run different threads on different multiprocessors? similar to CPU cores?
Suppose I have 2 large arrays a, b and I want to compute both sum and difference. Lets say I have 2 multiprocessors on my device. Is it possible to run both kernel functions (which compute sum and difference) concurrently on 2 different multiprocessors?
Using your example of computing both the sum and difference, the best performance is probably going to be achieved if you compute both at the same time (i.e. in the same kernel).
Assuming this is not possible for some reason, then if your arrays are very large then the best performance may be to use the whole GPU (i.e. multiple multiprocessors) to compute the result in which case it doesn't matter too much that you do one after the other.
For both of the above cases I would strongly recommend you check out the reduction sample in the SDK which walks you through a naive implementation up to a pretty quick version with good documentation.
Having said all of that, if the amount of work is sufficiently small that you would not be fully utilising the whole GPU for one of your computations then there are two ways to run different computations on different multiprocessors:
Use "Concurrent Kernels", where multiple kernels run on the same GPU at the same time. See the CUDA Programming Guide for more information and check out the concurrentKernels sample in the SDK, in essence you manual tell the scheduler that there is no dependency between the two (by using CUDA streams) which allows thenm to be executed simultaneously.
Have a switch on the blockIdx to select which code to execute.
The first of these is far more preferable if your hardware supports it (you will need Compute Capability 2.0 or greater) since it is far simpler to read and maintain.
yes, using Fermi devices and multiple streams.