Multiple kernels in cuda 4.0 - cuda

Is it possible to launch multiple kernels on multiple GPUs concurrently from a single thread in cuda 4.0?

To use multiple GPUs from a single thread, you can switch between cuda contexts (each of which is bound is bound to a GPU) and launch kernels asynchronously. In effect you will be running multiple kernels across multiple GPUs this way.
However if you have cards with compute capability > 2.0, you can also run kernels concurrently as shown in the comments above. You can find the post about concurrent kernel execution over here.
Ofcourse you can use both if you have multiple cards with compute capability >= 2.0.

yes.
If there are 2 devices you can run kernel1<<<>>> at device0 and kernel2<<<>>> at device1. there is an option setdevice() with which you choose the device on which the kernel will be executed.
google it, it is in the cuda library 4.0

Related

Does CUDA allow multiple applications on same gpu at the same time?

I have Tesla K20m GPU card from NVIDIA. In CUDA 5.0 onwards multiple processes from the same application on same GPU is allowed. Does CUDA allow execution of different applications on same GPU at the same time?
Depends what do you mean by 'at the same time'. If you mean 'two applications have CUDA contexts on same card at the same time' then yes.
Though you may want to use MPS to get full benefits and reduce context switching. See also this question.
Multiple applications may run at the same time on the same GPU. Namely, multiple applications can have a CUDA context at the same time and launch kernels, copy memory, etc...
But kernels from different CUDA contexts cannot be executed simultaneously on the same GPU. Meaning, at the very same slice of time, only kernels from a single CUDA context may be executed on a GPU. This may cause a GPU underutilization if kernels do not occupy the entire GPU resources (memory + compute), and some of the resources may be left unused.
MPS enables that by actually having a server with a single CUDA context, and all client processes communicate with the GPU device through this server, and eventually using its single CUDA context. This enables actual concurrency between kernel launches from different (logical) CUDA contexts.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

Is it possible to run two kernels belonging to different application run concurrently?

I want to run two kernels belonging to two different CUDA programs concurrently. I don't want context switching to happen. I want both the programs to share the device. I have 14 SMs in my GPU and one program uses 2 SMs and another uses 1 SM. But I am not able to run them concurrently. The second program waits until the first one finishes its execution. The GPU device is nVidia Tesla M2050. With deviceQuery it says the card supports concurrent kernel execution (Is this for kernels belonging to one program/process/context?). Also, the device is in Default computing mode. It is not set to any exclusive mode.
Is there any way I can run them concurrently?
First take a look at https://stackoverflow.com/a/14896945/56875.
GPUs can't execute in parallel kernels from two cuda contexts.
I haven't tried it before, but on linux you can try to cheat by using CUDA-proxy which is a single process that creats single context but allows other cuda processes to use proxy's context.
As #talonmies says in the comment - this is not currently possible. Only one CUDA context can be on the given device at any time.

Concurrent GPU kernel execution from multiple processes

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.