Running parallel CUDA tasks - cuda

I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?

First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).

Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.

Related

What does it mean by say GPU under ultilization due to low occupancy?

I am using NUMBA and cupy to perform GPU coding. Now I have switched my code from a V100 NVIDIA card to A100, but then, I got the following warnings:
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
Does anyone know what the two warnings really suggests? How should I improve my code then?
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
A GPU is subdivided into SMs. Each SM can hold a complement of threadblocks (which is like saying it can hold a complement of threads). In order to "fully utilize" the GPU, you would want each SM to be "full", which roughly means each SM has enough threadblocks to fill its complement of threads. An A100 GPU has 108 SMs. If your kernel has less than 108 threadblocks in the kernel launch (i.e. the grid), then your kernel will not be able to fully utilize the GPU. Some SMs will be empty. A threadblock cannot be resident on 2 or more SMs at the same time. Even 108 (one per SM) may not be enough. A A100 SM can hold 2048 threads, which is at least two threadblocks of 1024 threads each. Anything less than 2*108 threadblocks in your kernel launch may not fully utilize the GPU. When you don't fully utilize the GPU, your performance may not be as good as possible.
The solution is to expose enough parallelism (enough threads) in your kernel launch to fully "occupy" or "utilize" the GPU. 216 threadblocks of 1024 threads each is sufficient for an A100. Anything less may not be.
For additional understanding here, I recommend the first 4 sections of this course.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
One of the cool things about a numba kernel launch is that I can pass to it a host data array:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel[blocks, threads](a)
and numba will "do the right thing". In the above example it will:
create a device array that is for storage of a in device memory, let's call this d_a
copy the data from a to d_a (Host->Device)
launch your kernel, where the kernel is actually using d_a
when the kernel is finished, copy the contents of d_a back to a (Device->Host)
That's all very convenient. But what if I were doing something like this:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel1[blocks, threads](a)
my_kernel2[blocks, threads](a)
What numba will do is it will perform steps 1-4 above for the launch of my_kernel1 and then perform steps 1-4 again for the launch of my_kernel2. In most cases this is probably not what you want as a numba cuda programmer.
The solution in this case is to "take control" of data movement:
a = numpy.ones(32, dtype=numpy.int64)
d_a = numba.cuda.to_device(a)
my_kernel1[blocks, threads](d_a)
my_kernel2[blocks, threads](d_a)
a = d_a.to_host()
This eliminates unnecessary copying and will generally make your program run faster, in many cases. (For trivial examples involving a single kernel launch, there probably will be no difference.)
For additional understanding, probably any online tutorial such as this one, or just the numba cuda docs, will be useful.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Concurrent GPU kernel execution from multiple processes

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:
How to evaluate CUDA performance?
How Do You Profile & Optimize CUDA Kernels?
How to calculate Gflops of a kernel
Counting FLOPS/GFLOPS in program - CUDA
How to calculate the achieved bandwidth of a CUDA kernel
Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!
Nsight Visual Studio Edition 2.1 and Above
The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.
Visual Profiler 4.2 and Above
Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.
The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.
You could follow the calculations of Mark Harris in Optimizing Parallel Reductions in CUDA. There he uses the input data as base and divides it through the time of the kernel execution. In the examples he used 2^22 ints so he has 0,016777216 GB of input data. The first kernel took 8,054 ms which is an achieved bandwidth of 2,083 GB/s.
After several optimizations he approached 62,671 GB/s and compares it to the peak performance of the used GPU which is at 86,4 GB/s.
Although he used ints you can easily adapt that to flops/Gflops.

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.