Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code? - cuda

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?

It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080

Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Related

Getting total execution time of all kernels on a CUDA stream

I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort algorithms, for example, launch many kernels to get the job done). If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources. I care about the time it spends actually executing.
So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time? Maybe something in CUPTI can do this?
You can use Nsight Systems or Nsight Compute.
(https://developer.nvidia.com/tools-overview)
In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.

Jetson TK1 Multiple Streams parallel execution

Considering that Tk1 has single SM, is it really possible to run streams concurrently ? I have been unable to do so, even with latest vesions of cuda libraries.
So is it really possible ? any sample code would be great. The sample code under cuda Blas also runs sequential as show on visual profiler.
Also a better insight into what "Streams" are good for in a Single SM ?
[Already asked on nvidia dev forum, the forum isnt very active i think]
With a single Kepler SM, it is not possible to run several streams concurrently. Kepler does not support preemption. This is not related to a CUDA version, rather related to the capability of the SM. Something related to preemption has be discussed for Pascal at GTC 2016, but nothing before.
Regarding the actual use of streams with single SM, some async functions may behave slightly differently between stream 0 and other streams. Hence, I assume some corner case of async memcopy and execution might benefit from streams with single SM - as TK1 device query reads that it has concurrent copy and exec with 1 copy engine. (even though it might be that ZeroCopy is a better approach on TK1).

Interpreting NVIDIA Visual Profiler outputs

I have recently started playing with the NVIDIA Visual Profiler (CUDA 7.5) to time my applications.
However, I don't seem to fully understand the implications of the outputs I get. I am unprepared to know how to act to different profiler outputs.
As an example: A CUDA code that calls a single Kernel ~360 times in a for loop. Each time, the kernel computes 512^2 times about 1000 3D texture memory reads. A thread is allocated per unit of 512^2. Some arithmetic is needed to know which position to read in texture memory. Texture memory read is performed without interpolation, always in the exact data index. The reason 3D texture memory has been chose is because the memreads will be relatively random, so memory coalescence is not expected. I cant find the reference for this, but definitely read it in SO somewhere.
The description is short , but I hope it gives a small overview of what operations the kernel does (posting the whole kernel would be too much, probably, but I can if required).
From now on, I will describe my interpretation of the profiler.
When profiling, if I run Examine GPU usage I get (click to enlarge):
From here I see several things:
Low Memcopy/Compute overlap 0%. This is expected, as I run a big kernel, wait until it has finished and then memcopy. There should not be overlap.
Low Kernel Concurrency 0%. I just got 1 kernel, this is expected.
Low Memcopy Overlap 0%. Same thing. I only memcopy once in the begging, and I memcopy once after each kernel. This is expected.
From the kernel executions "bars", top and right I can see:
Most of the time is running kernels. There is little memory overhead.
All kernels take the same time (good)
The biggest flag is occupancy, below 45% always, being the registers the limiters. However, optimizing occupancy doesn't seem to be always a priority.
I follow my profiling by running Perform Kernel Analysis, getting:
I can see here that
Compute and memory utilization is low in the kernel. The profiler suggests that below 60% is no good.
Most of the time is in computing and L2 cache reading.
Something else?
I continue by Perform Latency Analysis, as the profiler suggests that the biggest bottleneck is there.
The biggest 3 stall reasons seem to be
Memory dependency. Too many texture memreads? But I need this amount of memreads.
Execution dependency. "can be reduced by increasing instruction level parallelism". Does this mean that I should try to change e.g. a=a+1;a=a*a;b=b+1;b=b*b; to a=a+1;b=b+1;a=a*a;b=b*b;?
Instruction fetch (????)
Questions:
Are there more additional tests I can perform to understand better my kernels execution time limitations?
Is there a ways to profile in the instruction level inside the kernel?
Are there more conclusions one can obtain by looking at the profiling than the ones I do obtain?
If I were to start trying to optimize the kernel, where would I start?
Are there more additional tests I can perform to understand better my
kernels execution time limitations?
Of course! If you pay attention to "Properties" window. Your screenshot is telling you that your kernel 1. Is limited by register usage (check it on 'Kernel Lantency' analisys), and 2.Warp Efficiency is low (less than 100% means thread divergece) (check it on 'Divergent Execution').
Is there a ways to profile in the instruction level inside the kernel?
Yes, you have available two types of profiling:
'Kernel Profile - Instruction Execution'
'Kernel Profile - PC Sampling' (Only in Maxwell)
Are there more conclusions one can obtain by looking at the profiling
than the ones I do obtain?
You should check if your kernel has some thread divergence. Also you should check that there is no problem with shared/global memory access patterns.
If I were to start trying to optimize the kernel, where would I start?
I find the Kernel Latency window the most useful one, but I suppose it depends on the type of kernel you are analyzing.

Independent kernel not executing concurrently

I'm implementing a Radon-like transform in CUDA, but I can't seem to get all performance out of my GeForce TITAN (EDIT: apparently I do, see comments). In order to optimize this, I thought of executing the kernels concurrently since they require only minimal data transfers, but I can't manage to get kernels to execute at the same time.
A typical profile run looks like this:
This is with "concurrent kernel support" enabled, compiling and generating code for sm_35 using CUDA 5.5 (RC). Overlap is minimal, and hardly worth it.
I've read a bit about concurrent kernel execution, and tried different things to get it right:
Launch the kernel in different streams
Interleave kernel launches, e.g. first launch kernel A n times using n streams, then launch kernel B n times using the same n streams, etc (although this might not be necessary any more for Kepler; the hardware managed to partially overlap kernels even when launched non-interleaved)
Make sure that kernels don't use the same global memory (although I don't know whether that matters)
Make sure that kernels don't use too much shared memory (the rotation kernels doesn't use any)
I don't get why the rotation kernels don't overlap more. Am I resource constrained, and if so, how can I find this out? If I use more diverse kernels it manages to parallelize a bit more, for example in this one,
but I think it should do better...
EDIT: removed the 20% figure since I cannot reproduce it, and it seems to be wrong as well

Concurrent GPU kernel execution from multiple processes

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.