Parallelism in GPU - CUDA / OpenCL - cuda

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.

Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.

What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.

If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

What happen if a CUDA kernel is called from multiple pthreads simultaneously?

I have a CUDA kernel that do my hard work, but I also have some hard work that need to be done in the CPU (calculations with two positions of the same array) that I could not write in CUDA (because CUDA threads are not synchronous, I need to perform a hard work on a position X of an array and after do z[x] = y[x] - y[x - 1], where y is the array resultant of a CUDA kernel where each thread works on one position of this array and z is another array storing the result). So I'm doing this in the CPU.
I have several CPU threads to do the CPU side work, but each one is calling a CUDA kernel passing some data. My question is: what happens on the GPU side when multiple CPU threads are making GPU calls? Would be better if I do the CUDA kernel call once and then create multiple CPU threads to do the CPU side work?
Kernel calls are queued and executed one by one in single stream.
However you can specify stream during kernel execution - then CUDA operations in different streams may run concurrently and operations from different streams may be interleaved.
Default stream is 0.
See: CUDA Streams and Concurrency
Things are similar when different processes use the same card.
Also remember that kernels are executed asynchronously from CPU stuff.
On CUDA 4.0 and later, multiple threads can share the same CUDA context, hence no need for cuPush/PopContext anymore. You just need to call cudaSetDevice for each thread. Then, mentioned #dzonder, you can run multple kernels simulatenously from different threads with streams.

Running parallel CUDA tasks

I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.

Kepler CUDA dynamic parallelism and thread divergence

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?
Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..
No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.
There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.