multiple processes of video processing to GPU cores - cuda

Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
In my application of video processing, I have to use ffmpeg library to proceed video and audio. If there are like more than 100 or even 500 such independent processes, I guess it's faster to assign each process to a GPU. However, I don't know if we can do it, and to do it, what libraries, tools are necessary? CUDA?

Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
No, you can't. In general it's not possible to schedule anything on a GPU core per se. This level of "scheduling" is handled mainly by the mechanics of the CUDA architecture and runtime system.
The basic idea is to expose parallelism at a fairly low level in your code (e.g. at the loop level) and with proper use of a GPU acceleration syntax (such as CUDA, OpenACC, OpenCL, etc.) the GPU can often make such elements of your program run faster.
But the GPU is not designed to be a drop-in replacement for CPU cores. There is the scheduling factor that I mentioned already, as well as the fact that codes generally need to be compiled for the GPU specifically.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

CUDA vs OpenCL performance comparison

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Concurrent GPU kernel execution from multiple processes

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

What is difference between monolithic and micro kernel?

Could anyone please explain with examples difference between monolithic and micro kernel? Also other classifications of the kernel?
Monolithic kernel is a single large process running entirely in a single address space. It is a single static binary file. All kernel services exist and execute in the kernel address space. The kernel can invoke functions directly. Examples of monolithic kernel based OSs: Unix, Linux.
In microkernels, the kernel is broken down into separate processes, known as servers. Some of the servers run in kernel space and some run in user-space. All servers are kept separate and run in different address spaces. Servers invoke "services" from each other by sending messages via IPC (Interprocess Communication). This separation has the advantage that if one server fails, other servers can still work efficiently. Examples of microkernel based OSs: Mac OS X and Windows NT.
Monolithic kernel design is much older than the microkernel idea, which appeared at the end of the 1980's.
Unix and Linux kernels are monolithic, while QNX, L4 and Hurd are microkernels. Mach was initially a microkernel (not Mac OS X), but later converted into a hybrid kernel. Minix (before version 3) wasn't a pure microkernel because device drivers were compiled as part of the kernel.
Monolithic kernels are usually faster than microkernels. The first microkernel Mach was 50% slower than most monolithic kernels, while later ones like L4 were only 2% or 4% slower than the monolithic designs.
Monolithic kernels are big in size, while microkernels are small in size - they usually fit into the processor's L1 cache (first generation microkernels).
In monolithic kernels, the device drivers reside in the kernel space while in the microkernels the device drivers are user-space.
Since monolithic kernels' device drivers reside in the kernel space, monolithic kernels are less secure than microkernels, and failures (exceptions) in the drivers may lead to crashes (displayed as BSODs in Windows). Microkernels are more secure than monolithic kernels, hence more often used in military devices.
Monolithic kernels use signals and sockets to implement inter-process communication (IPC), microkernels use message queues. 1st gen microkernels didn't implement IPC well and were slow on context switches - that's what caused their poor performance.
Adding a new feature to a monolithic system means recompiling the whole kernel or the corresponding kernel module (for modular monolithic kernels), whereas with microkernels you can add new features or patches without recompiling.
Monolithic kernel
All the parts of a kernel like the Scheduler, File System, Memory Management, Networking Stacks, Device Drivers, etc., are maintained in one unit within the kernel in Monolithic Kernel
Advantages
•Faster processing
Disadvantages
•Crash Insecure
•Porting Inflexibility
•Kernel Size explosion
Examples
•MS-DOS, Unix, Linux
Micro kernel
Only the very important parts like IPC(Inter process Communication), basic scheduler, basic memory handling, basic I/O primitives etc., are put into the kernel. Communication happen via message passing. Others are maintained as server processes in User Space
Advantages
•Crash Resistant, Portable, Smaller Size
Disadvantages
•Slower Processing due to additional Message Passing
Examples
•Windows NT
1.Monolithic Kernel (Pure Monolithic) :all
All Kernel Services From single component
(-) addition/removal is not possible, less/Zero flexible
(+) inter Component Communication is better
e.g. :- Traditional Unix
2.Micro Kernel :few
few services(Memory management ,CPU management,IPC etc) from core kernel, other services(File management,I/O management. etc.) from different layers/component
Split Approach [Some services is in privileged(kernel) mode and some are in Normal(user) mode]
(+)flexible for changes/up-gradations
(-)communication overhead
e.g.:- QNX etc.
3.Modular kernel(Modular Monolithic) :most
Combination of Micro and Monolithic kernel
Collection of Modules -- modules can be --> Static + Dynamic
Drivers come in the form of Modules
e.g. :- Linux Modern OS
In the spectrum of kernel designs the two extreme
points are monolithic kernels and microkernels.
The (classical) Linux
kernel for instance is a monolithic kernel (and so is every commercial OS
to date as well - though they might claim otherwise);
In that its code is a
single C file giving rise to a single process that implements all of the above
services.
To exemplify the encapsulation of the Linux kernel we remark that
the Linux kernel does not even have access to any of the standard C libraries.
Indeed the Linux kernel cannot use rudimentary C library functions such as
printf. Instead it implements its own printing function (called prints).
This seclusion of the Linux kernel and self-containment provide Linux kernel
with its main advantage: the kernel resides in a single address space1
enabling
all features to communicate in the fastest way possible without resorting to
any type of message passing.
In particular, a monolithic kernel implements all of the device drivers
of the system.This however is the main drawback of a monolithic kernel:
introduction of any new unsupported hardware requires a rewrite of the
kernel (in the relevant parts), recompilation of it, and re-installing the entire
OS.More importantly, if any device driver crashes the entire kernel suffers
as a result.
This un-modular approach to hardware additions and hardware crashes
is the main argument for supporting the other extreme design approach
for kernels. A microkernel is in a sense a minimalistic kernel that houses
only the very basic of OS services (like process management and file system
management). In a microkernel the device drivers lie outside of the kernel
allowing for addition and removal of device drivers while the OS is running
and require no alternations of the kernel.
Monolithic kernel has all kernel services along with kernel core part, thus are heavy and has negative impact on speed and performance. On the other hand micro kernel is lightweight causing increase in performance and speed.
I answered same question at wordpress site.
For the difference between monolithic, microkernel and exokernel in tabular form, you can visit here