OpenCL and CUDA kernels on same GPU - cuda

I'm new to this technology. I have an application which consists of OpenCL kernel and CUDA kernel. I want to execute OpenCL kernel and CUDA kernel one after another on the same GPU(Tesla M2050). Is it possible to execute.?
If it is possible, do we need to take care of any memory management.?
Thanks in advance

Yes it is possible to run OpenCL kernels and CUDA Kernels from the same application. Each has its own schedulers. Memory management will be taken care by the GPU Driver.

Related

What is the relationship between NVIDIA GPUs' CUDA cores and OpenCL computing units?

My computer has a GeForce GTX 960M which is claimed by NVIDIA to have 640 CUDA cores. However, when I run clGetDeviceInfo to find out the number of computing units in my computer, it prints out 5 (see the figure below). It sounds like CUDA cores are somewhat different from what OpenCL considers as computing units? Or maybe a group of CUDA cores form an OpenCL computing unit? Can you explain this to me?
What is the relationship between NVIDIA GPUs' CUDA cores and OpenCL computing units?
Your GTX 960M is a Maxwell device with 5 Streaming Multiprocessors, each with 128 CUDA cores, for a total of 640 CUDA cores.
The NVIDIA Streaming Multiprocessor is equivalent to an OpenCL Compute Unit. The previously linked answer will also give you some useful information that may help with your kernel sizing question in the comments.
The CUDA architecture is a close match to the OpenCL architecture.
A CUDA device is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). A multiprocessor corresponds to an OpenCL compute unit.
A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group. A kernel is executed over an OpenCLNDRange by a grid of thread blocks. As illustrated in Figure 2-1, each of the thread blocks that execute a kernel is therefore uniquely identified by its work-group ID, and each thread by its global ID or by a combination of its local ID and work-group ID.
Copied from OpenCL Programming Guide for the CUDA Architecture http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf

A cuda wrapper to execute openCL

I'm involved in a project where I have to do gpu programming, one of my constraint is to do it on a nvidia device (thus in CUDA).
But I haven't access to a device equipped with nvidia gpu.
So I would like to know if there is any wrapper that exist which could allow me to write a CUDA code but executed as an openCL code to make it work on an amd gpu ?
ps : gpuocelot could fit well IF I would not have to do it on windows system.
Is the "CUDA" constraint an actual one? Because GPU programming on NVIDIA hardware doesn't necessarily imply CUDA. You have other possible solutions such as:
OpenCL which you mentioned already, which is quite complex and cumbersome to use, but which opens you up plenty of possible back-ends.
Thrust which permits you to target NVIDIA GPUs with a CUDA back-end, or CPUs with an OpenMP and a TBB back-end.
OpenACC with the PGI compiler which permits (AFAIK) to target both NVIDIA and AMD GPUs.
If it were me and the code permitting, I would try to develop using Thrust. But that's up to you.
You could take a look at GPU Ocelot. According to its website:
Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation.

Does CUDA allow multiple applications on same gpu at the same time?

I have Tesla K20m GPU card from NVIDIA. In CUDA 5.0 onwards multiple processes from the same application on same GPU is allowed. Does CUDA allow execution of different applications on same GPU at the same time?
Depends what do you mean by 'at the same time'. If you mean 'two applications have CUDA contexts on same card at the same time' then yes.
Though you may want to use MPS to get full benefits and reduce context switching. See also this question.
Multiple applications may run at the same time on the same GPU. Namely, multiple applications can have a CUDA context at the same time and launch kernels, copy memory, etc...
But kernels from different CUDA contexts cannot be executed simultaneously on the same GPU. Meaning, at the very same slice of time, only kernels from a single CUDA context may be executed on a GPU. This may cause a GPU underutilization if kernels do not occupy the entire GPU resources (memory + compute), and some of the resources may be left unused.
MPS enables that by actually having a server with a single CUDA context, and all client processes communicate with the GPU device through this server, and eventually using its single CUDA context. This enables actual concurrency between kernel launches from different (logical) CUDA contexts.

CUDA How to launch a new kernel call in one kernel function?

I am new to CUDA programming. Now, I have a problem to handle: I am trying to use CUDA parallel programming to handle a set of datasets. And for each datasets, there are some matrix calculation needed to be done.
My design is like this:
Launch N threads to handle each dataset as they are independent to each other and the method to handle them are the same.
In each thread in 1, I want to use a new function and this function also works like a kernel as they are matrix calc... e.g. call M threads to parallel handle matrix calculation..
Does anyone know whether it is possible or not?
You can launch a kernel from a thread in another kernel if you use CUDA dynamic parallelism and your GPU supports it. GPUs that support CUDA dynamic parallelism currently are of compute capability 3.5.
You can discover the compute capability of your device from the CUDA deviceQuery sample.
You can learn more about how to use CUDA dynamic parallelism from the CUDA programming guide section.

Running parallel CUDA tasks

I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.