Is it possible to kill a running CUDA kernel? - cuda

Let us say I have numerous CUDA kernels that I can ask the GPU to execute. I don't want to modify the kernel code in anyway (to include a trap for eg).
Is there a way to kill such a running kernel?
I intend to auto generate kernels (Genetic Programming). These kernels are likely to have behavior where they may take a very long time to complete. If I can kill a kernel while it is running I can maintain a timer and kill if required.

cudaDeviceReset() will kill any running kernel(s).
It will also wipe out any allocations done on the device, so you will need to re-allocate any data areas if you intend to use them again.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Related

How do I know the presence of nvprof inside CUDA program?

I have a small CUDA program that I want to profile with nvprof. The problem is that I want to write the program in such a way that
When I run nvprof my_prog, it will invoke cudaProfilerStart and cudaProfilerStop.
When I run my_prog, it will not invoke any of the above APIs, and therefore can get rid of profiling overhead.
The problem hence becomes how to make my code aware of the presence of nvprof when it runs, without additional command line argument.
Have you measured and verified that cudaProfilerStart/Stop calls introduce measurable overheads when nvprof is not attached? I highly doubt that this is the case.
If this is a problem, you can use #ifdef directives to exclude these calls from your release builds.
There is no way of detecting whether nvprof is running, since that kind of defeats the purpose of profiling - if the profiled application "senses" the profiler and changes its behavior.

What are the factors that affect CUDA kernels launch time

I have a set of CUDA kernels. Each kernel completes its job in less than 10 microsec, however, its launch time is 50-70 microsec. I am suspecting the use of texture memory might be the reason, since it is used in my kernels.
Are there any recommendations to reduce the launch time of CUDA kernels? In general, what are the factors that affect the kernel launch time?
You can reduce the overall launch time by launching fewer kernels; e.g. if you launch several kernels in sequence, you could write a new single kernel that does all of that work in a single launch.
From the very little bit of context currently in the question, I suspect this is your problem; you are doing too little work per kernel.
(my next guess is an error in benchmark; i.e. the times aren't for what you think they are)

CUDA: it is possible for a kernel to return a break to CPU?

I'm writing a C program using CUDA parallelization, and I was wondering if it is possible for a kernel to return a break to CPU.
My program essentially do a for loop and inside that loop I take several parallel actions; at the start of each iteration I have to take a control over a variable (measuring the improvement of the just done iteration) which resides on the GPU.
My desire is that the control over that variable returns a break to CPU in order to exit the for loop(I take the control using a trivial kernel <<<1,1>>>).
I've tried copying back that variable on the CPU and take the control on the CPU but, as I feared, it sloows down the overall execution.
Any advice?
There is presently no way for any running code on a CUDA capable GPU to preempt running code on the host CPU. So although it isn't at all obvious what you are asking about, I'm fairly certain the answer is no just because there is no host side preempt or interrupt mechanism available available in device code.
There is no connection betweeen CPU code and GPU code.
All what you can do while working with CUDA is:
From CPU side allocate memory in GPU
Copy data to GPU
Launch executioning of prewritten instructions for GPU (GPU is black box for CPU)
Read data back.
So thinking about these steps in a loop, all what is left to you is to check result and break the loop if need to.

Independent kernel not executing concurrently

I'm implementing a Radon-like transform in CUDA, but I can't seem to get all performance out of my GeForce TITAN (EDIT: apparently I do, see comments). In order to optimize this, I thought of executing the kernels concurrently since they require only minimal data transfers, but I can't manage to get kernels to execute at the same time.
A typical profile run looks like this:
This is with "concurrent kernel support" enabled, compiling and generating code for sm_35 using CUDA 5.5 (RC). Overlap is minimal, and hardly worth it.
I've read a bit about concurrent kernel execution, and tried different things to get it right:
Launch the kernel in different streams
Interleave kernel launches, e.g. first launch kernel A n times using n streams, then launch kernel B n times using the same n streams, etc (although this might not be necessary any more for Kepler; the hardware managed to partially overlap kernels even when launched non-interleaved)
Make sure that kernels don't use the same global memory (although I don't know whether that matters)
Make sure that kernels don't use too much shared memory (the rotation kernels doesn't use any)
I don't get why the rotation kernels don't overlap more. Am I resource constrained, and if so, how can I find this out? If I use more diverse kernels it manages to parallelize a bit more, for example in this one,
but I think it should do better...
EDIT: removed the 20% figure since I cannot reproduce it, and it seems to be wrong as well

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance