Kernels not running concurrently in CUDA - cuda

I have a kernel that runs on my GPU (GeForce 690) and uses a single block. It runs in about 160 microseconds. My plan is to launch 8 of these kernels separately, each of which only uses a single block, so each would run on a separate SM, and then they'd all run concurrently, hopefully in about 160 microseconds.
However, when I do that, the total time increases linearly with each kernel: 320 microseconds if I run 2 kernels, about 490 microseconds for 3 kernels, etc.
My question: Do I need to set any flag somewhere to get these kernels to run concurrently? Or do I have to do something that isn't obvious?

As #JackOLantern indicated concurrent kernels require the usage of streams, which are required for all forms of asynchronous activity scheduling on the GPU. It also requires a GPU of compute capability 2.0 or greater, generally speaking. If you do not use streams in your application, all cuda API and kernel calls will be executed sequentially, in the order in which they were issued in the code, with no overlap from one call/kernel to the next.
Rather than give a complete tutorial here, please review the concurrent kernels cuda sample that JackOlantern referenced.
Also note that actually witnessing concurrent execution can be more difficult on windows, for a variety of reasons. If you run the concurrent kernels sample, it will indicate pretty quickly if the environment you are in (OS, driver, etc.) is providing concurrent execution.

Related

Profile concurrent CUDA kernels

I am interested in getting memory performance counter of concurrent cuda kernels. I tried to use several nvprof options like, --metrics all and --print-gpu-trace. The output seems to indicate that kernels are not concurrent any more. And concurrent performance metrics of each kernel look almost exactly the same as those running each kernel alone. I think that these concurrent kernels ran in sequence. How could I get memory performance metrics counter of concurrent kernels, for example L2 cache?
You cannot do per-kernel profiling while having the kernels execute concurrently. You can however try the following workarounds:
Do only tracing. If you don't specify --metrics or --events, nvprof will only do a tracing run. In this case, nvprof will run the kernels concurrently, but you will only get kernel timings - not metric/event data.
If you own an NVIDIA Tesla GPU (as opposed to GeForce or Quadro), you can use the CUPTI library's cuptiSetEventCollectionMode(CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS) API to sample the metrics you want while the kernels are running concurrently. However, this will only allow you to get the aggregate metric/event data in that sampling interval - which means that you will not be able to correlate this data to individual kernels. CUPTI ships with a code sample called event_sampling, that demonstrates how to use this API.
Profile the metrics/events you want, and let the kernels serialize. For some metrics/events, you may be able to simply sum up the values to estimate behavior during concurrent execution.

Concurrent Kernels with CUSPARSE Library

I would like to ask you a question about the concurrent kernel execution in Nvidia GPUs. I explain us my situation. I have an code which launchs 1 sparse matrix multiplication for 2 different matrix (one for each one). These matrix multiplications are performed with the cuSPARSE Library. I want both operations can be concurrently performed, so I use 2 streams to launch them. With Nvidia Visual profiler, I´ve observed that both operations (cuSPARSE kernels) are completely overlaped. The time stamps for both kernels are:
Kernel 1) Start Time: 206,205 ms - End Time: 284,177 ms.
Kernel 2) Start Time: 263,519 ms - End Time: 278,916 ms.
I´m using a Tesla K20c with 13 SMs which can execute up 16 blocks per SM. Both kernels have 100% occupancy and launch an enough amount of blocks:
Kernel 1) 2277 blocks, 32 Register/Thread, 1,156 KB shared memory.
Kernel 2) 46555 blocks, 32 Register/Thread, 1,266 KB shared memory.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU. However, Nvidia Visual Profiler shows that these kernels are being overlaped. Why?. Anyone could explain me why this behaviour can occur?
Many thanks in advance.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU.
I think this is an incorrect statement. As far as I know, the low-level scheduling behavior of blocks is not specified in CUDA.
Newer devices (cc3.5+ with hyper-Q) can more easily schedule blocks from concurrent kernels at the same time.
So, if you launch 2 kernels (A and B), each with large numbers of blocks, concurrently, then you may observe either
blocks from kernel A execute concurrently with kernel B
all (or nearly all) of the blocks of kernel A execute before kernel B
all (or nearly all) of the blocks of kernel B execute before kernel A
Since there is no specification at this level, there is no direct answer. Any of the above are possible. The low level block scheduler is free to choose blocks in any order, and the order is not specified.
If a given kernel launch "completely saturates" the machine (i.e. uses enough resources to fully occupy the machine while it is executing) then there is no reason to think that the machine has extra capacity for a second concurrent kernel. Therefore there would be no reason to expect much, if any, speed up from running the two kernels concurrently as opposed to sequentially. In such a scenario, whether they execute concurrently or not, we would expect the total execution time for the 2 kernels running concurrently to be approximately the same as the total execution time if the two kernels are launched or scheduled sequentially (ignoring tail effects and launch overheads, and the like).

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

CUDA: why is there a large amount of GPU idle time?

Question
total GPU time + total CPU overhead is smaller than the total execution time. Why?
Detail
I am studying how frequent global memory access and kernel launch may affect the performance and I have designed a code which has multiple small kernels and ~0.1 million kernel calls in total. Each kernel reads data from global memory, processes them and then writes back to the global memory. As expected, the code runs much slower than the original design which has only one large kernel and very few kernel launches.
The problem arose as I used command line profiler to get "gputime" (execution time for the GPU kernel or memory copy method) and "cputime" (CPU overhead for non-blocking method, the sum of gputime and CPU overhead for blocking method ). To my understanding, the sum of all gputimes and all cputimes should exceed the entire execution time (the last "gpuendtimestamp" minus the first "gpustarttimestamp"), but it turns out the contrary is true (sum of gputimes=13.835064 s,
sum of cputimes=4.547344 s, total time=29.582793). Between the end of one kernel and the start of the next, there is often a large amount of waiting time, larger than the CPU overhead of the next kernel. Most of the kernels suffer from this problem are: memcpyDtoH, memcpyDtoD and thrust internel functions such as launch_closure_by_value, fast_scan, etc. What is the probable reason?
System
Windows 7, TCC driver, VS 2010, CUDA 4.2
Thanks for your help!
This is possibly a combination of profiling, which increases latency, and the Windows WDDM subsystem. To overcome the high latency of the latter, the CUDA driver batches GPU operations and submits them in groups with a single Windows kernel call. This can cause large periods of GPU inactivity if CUDA API commands are sitting in an unsubmitted batch.
(Copied #talonmies' comment to an answer, to enable voting and accepting.)