Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance - cuda

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:
How to evaluate CUDA performance?
How Do You Profile & Optimize CUDA Kernels?
How to calculate Gflops of a kernel
Counting FLOPS/GFLOPS in program - CUDA
How to calculate the achieved bandwidth of a CUDA kernel
Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!

Nsight Visual Studio Edition 2.1 and Above
The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.
Visual Profiler 4.2 and Above
Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.
The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.

You could follow the calculations of Mark Harris in Optimizing Parallel Reductions in CUDA. There he uses the input data as base and divides it through the time of the kernel execution. In the examples he used 2^22 ints so he has 0,016777216 GB of input data. The first kernel took 8,054 ms which is an achieved bandwidth of 2,083 GB/s.
After several optimizations he approached 62,671 GB/s and compares it to the peak performance of the used GPU which is at 86,4 GB/s.
Although he used ints you can easily adapt that to flops/Gflops.

Related

How to profile the CUDA application only by nvprof

I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.
GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof to generate flop_count_sp_* metrics.
So there are two questions that I want to ask:
How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.
================ Update =======================
For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results:
nvvp results
You can use nVIDIA's NVTX library to programmatically mark named ranges or points on your timeline. The length of such a range, properly defined, would constitute your "elapsed time", and would show up very clearly in the nvvp visualization tool. Here is a "CUDA pro tip" blog post about doing this:
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
and if you want to do this in a more C++-friendly and RAII way, you can use my CUDA runtime API wrappers, which offer a scoped range marker and other utility functions. Of course, with me being the author, take my recommendation with a grain of salt and see what works for you.
About the "Elapsed time" for the session - that's the time between when you start and stop profiling activity. That can either be when the process comes up, or when you explicitly have profiling start. In my own API wrappers, there's a RAII class for that as well: cuda::profiling::scope or of course you can use the C-style API calls explicitly. (I should really write a sample program doing this, I haven't gotten around to that yet, unfortunately).

Interpreting NVIDIA Visual Profiler outputs

I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
Most of the time is running kernels. There is little memory overhead.
All kernels take the same time (good)
The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
Memory dependency. Too many texture memreads? But I need this amount of memreads.
Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
Instruction fetch (????)
Questions:
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Is there a ways to profile in the instruction level inside the kernel?
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
If I were to start trying to optimize the kernel, where would I start?
Are there more additional tests I can perform to understand better my
kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
'Kernel Profile - Instruction Execution'
'Kernel Profile - PC Sampling' (Only in Maxwell)
Are there more conclusions one can obtain by looking at the profiling
than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

CUDA Visual Profiler v4.1, Possible to make Counter plot and to find detailed info about a kernel [duplicate]

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.
Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:
Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.
Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.
Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.
This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.
Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.
At this time hardware performance counters and timing is only available at the kernel level.
Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.