Timing of cuda kernel from inside of kernel? - cuda

background:
I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besides it all our timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.
question:
Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.

The only way to time execution from entirely within a kernel is to use the clock() and clock64() functions that are covered in the programming guide.
Since these functions sample a per-multiprocessor counter, and AFAIK there is no specified relationship between these counters from one SM to the next, there is no way to determine using these functions alone, which thread/warp/block is "first" to execute and which is "last" to execute, assuming your GPU has more than 1 SM. (Even if there were a specified relationship, such as "they are all guaranteed to be the same value on any given cycle", you would still need additional scaffolding as mentioned below.)
While you could certainly create some additional scaffolding in your code to try to come up with an overall execution time (perhaps adding in atomics to figure out which thread/warp/block is first and last), there may still be functional gaps in the method. Given the difficulty, it seems that the best method, based on what you've described, is simply to use the profilers as discussed by #njuffa in the comments. Any of the profilers can provide you with the execution time of a kernel, on any supported platform, with a trivial set of commands.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Independent kernel not executing concurrently

I'm implementing a Radon-like transform in CUDA, but I can't seem to get all performance out of my GeForce TITAN (EDIT: apparently I do, see comments). In order to optimize this, I thought of executing the kernels concurrently since they require only minimal data transfers, but I can't manage to get kernels to execute at the same time.
A typical profile run looks like this:
This is with "concurrent kernel support" enabled, compiling and generating code for sm_35 using CUDA 5.5 (RC). Overlap is minimal, and hardly worth it.
I've read a bit about concurrent kernel execution, and tried different things to get it right:
Launch the kernel in different streams
Interleave kernel launches, e.g. first launch kernel A n times using n streams, then launch kernel B n times using the same n streams, etc (although this might not be necessary any more for Kepler; the hardware managed to partially overlap kernels even when launched non-interleaved)
Make sure that kernels don't use the same global memory (although I don't know whether that matters)
Make sure that kernels don't use too much shared memory (the rotation kernels doesn't use any)
I don't get why the rotation kernels don't overlap more. Am I resource constrained, and if so, how can I find this out? If I use more diverse kernels it manages to parallelize a bit more, for example in this one,
but I think it should do better...
EDIT: removed the 20% figure since I cannot reproduce it, and it seems to be wrong as well

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance

CUDA: Stop all other threads

I have a problem that is seemingly solvable by enumerating all possible solutions and then finding the best. In order to do so, I devised a backtracking algorithm that enumerates and stores the best solution if found. It works fine so far.
Now, I wanted to port this algorithm to CUDA. Therefore, I created a procedure that generates some distinct basic cases. These basic cases should be processed in parallel on the GPU. If one of the CUDA-threads finds an optimal solution, all the other threads can - of course - stop their work.
So, I wanted kind of the following: The thread that finds the optimal solution should stop all running CUDA-threads of my program, thus finishing calculation.
After some quick search, I found that threads can only communicate if they are in the same block. (So I suppose it's impossible to stop others blocks threads.)
The only method I could think of is that I have a dedicated flag optimum_found, which is checked at the beginning of every kernel. If an optimum solution is found, this flag is set to 1, so all future threads know that they do not have to work. But of course, threads already running do not notice this flag if they do not check it at every iteration.
So, is there a possibility to stop all remaining CUDA-threads?
I think that your method of having a dedicated flag could work provided that it was a memory location in global memory. That way you can check this, as you said, at the beginning of each kernel call.
Kernel calls should generally be relatively short anyways, therefore letting the other threads in a batch finish even though an optimal solution was found by one of those threads shouldn't affect your performance too much.
That said, I am fairly sure there is no CUDA call that can kill off other actively executing threads.
I think Ian has the right idea here. Optimum performance would come from minimal memory transfers and branching. Writing to global memory and checking flags (branching) goes against the CUDA best practices guide and will reduce your speedup.
You might want to look at callbacks. The main CPU thread can make sure all threads run in the right order. CPU callback threads (read: postprocessing) can do additional overhead and call the related api functions as well as disposing all of the sub thread data... This feature is found in cuda samples and compiles on cuda capability 2. Hope this helps.