Could anyone please explain with examples difference between monolithic and micro kernel? Also other classifications of the kernel?
Monolithic kernel is a single large process running entirely in a single address space. It is a single static binary file. All kernel services exist and execute in the kernel address space. The kernel can invoke functions directly. Examples of monolithic kernel based OSs: Unix, Linux.
In microkernels, the kernel is broken down into separate processes, known as servers. Some of the servers run in kernel space and some run in user-space. All servers are kept separate and run in different address spaces. Servers invoke "services" from each other by sending messages via IPC (Interprocess Communication). This separation has the advantage that if one server fails, other servers can still work efficiently. Examples of microkernel based OSs: Mac OS X and Windows NT.
Monolithic kernel design is much older than the microkernel idea, which appeared at the end of the 1980's.
Unix and Linux kernels are monolithic, while QNX, L4 and Hurd are microkernels. Mach was initially a microkernel (not Mac OS X), but later converted into a hybrid kernel. Minix (before version 3) wasn't a pure microkernel because device drivers were compiled as part of the kernel.
Monolithic kernels are usually faster than microkernels. The first microkernel Mach was 50% slower than most monolithic kernels, while later ones like L4 were only 2% or 4% slower than the monolithic designs.
Monolithic kernels are big in size, while microkernels are small in size - they usually fit into the processor's L1 cache (first generation microkernels).
In monolithic kernels, the device drivers reside in the kernel space while in the microkernels the device drivers are user-space.
Since monolithic kernels' device drivers reside in the kernel space, monolithic kernels are less secure than microkernels, and failures (exceptions) in the drivers may lead to crashes (displayed as BSODs in Windows). Microkernels are more secure than monolithic kernels, hence more often used in military devices.
Monolithic kernels use signals and sockets to implement inter-process communication (IPC), microkernels use message queues. 1st gen microkernels didn't implement IPC well and were slow on context switches - that's what caused their poor performance.
Adding a new feature to a monolithic system means recompiling the whole kernel or the corresponding kernel module (for modular monolithic kernels), whereas with microkernels you can add new features or patches without recompiling.
Monolithic kernel
All the parts of a kernel like the Scheduler, File System, Memory Management, Networking Stacks, Device Drivers, etc., are maintained in one unit within the kernel in Monolithic Kernel
Advantages
•Faster processing
Disadvantages
•Crash Insecure
•Porting Inflexibility
•Kernel Size explosion
Examples
•MS-DOS, Unix, Linux
Micro kernel
Only the very important parts like IPC(Inter process Communication), basic scheduler, basic memory handling, basic I/O primitives etc., are put into the kernel. Communication happen via message passing. Others are maintained as server processes in User Space
Advantages
•Crash Resistant, Portable, Smaller Size
Disadvantages
•Slower Processing due to additional Message Passing
Examples
•Windows NT
1.Monolithic Kernel (Pure Monolithic) :all
All Kernel Services From single component
(-) addition/removal is not possible, less/Zero flexible
(+) inter Component Communication is better
e.g. :- Traditional Unix
2.Micro Kernel :few
few services(Memory management ,CPU management,IPC etc) from core kernel, other services(File management,I/O management. etc.) from different layers/component
Split Approach [Some services is in privileged(kernel) mode and some are in Normal(user) mode]
(+)flexible for changes/up-gradations
(-)communication overhead
e.g.:- QNX etc.
3.Modular kernel(Modular Monolithic) :most
Combination of Micro and Monolithic kernel
Collection of Modules -- modules can be --> Static + Dynamic
Drivers come in the form of Modules
e.g. :- Linux Modern OS
In the spectrum of kernel designs the two extreme
points are monolithic kernels and microkernels.
The (classical) Linux
kernel for instance is a monolithic kernel (and so is every commercial OS
to date as well - though they might claim otherwise);
In that its code is a
single C file giving rise to a single process that implements all of the above
services.
To exemplify the encapsulation of the Linux kernel we remark that
the Linux kernel does not even have access to any of the standard C libraries.
Indeed the Linux kernel cannot use rudimentary C library functions such as
printf. Instead it implements its own printing function (called prints).
This seclusion of the Linux kernel and self-containment provide Linux kernel
with its main advantage: the kernel resides in a single address space1
enabling
all features to communicate in the fastest way possible without resorting to
any type of message passing.
In particular, a monolithic kernel implements all of the device drivers
of the system.This however is the main drawback of a monolithic kernel:
introduction of any new unsupported hardware requires a rewrite of the
kernel (in the relevant parts), recompilation of it, and re-installing the entire
OS.More importantly, if any device driver crashes the entire kernel suffers
as a result.
This un-modular approach to hardware additions and hardware crashes
is the main argument for supporting the other extreme design approach
for kernels. A microkernel is in a sense a minimalistic kernel that houses
only the very basic of OS services (like process management and file system
management). In a microkernel the device drivers lie outside of the kernel
allowing for addition and removal of device drivers while the OS is running
and require no alternations of the kernel.
Monolithic kernel has all kernel services along with kernel core part, thus are heavy and has negative impact on speed and performance. On the other hand micro kernel is lightweight causing increase in performance and speed.
I answered same question at wordpress site.
For the difference between monolithic, microkernel and exokernel in tabular form, you can visit here
Related
I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_DEVICE_MAX_CONNECTIONS environment variable to a higher value if you need more.
Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
In my application of video processing, I have to use ffmpeg library to proceed video and audio. If there are like more than 100 or even 500 such independent processes, I guess it's faster to assign each process to a GPU. However, I don't know if we can do it, and to do it, what libraries, tools are necessary? CUDA?
Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
No, you can't. In general it's not possible to schedule anything on a GPU core per se. This level of "scheduling" is handled mainly by the mechanics of the CUDA architecture and runtime system.
The basic idea is to expose parallelism at a fairly low level in your code (e.g. at the loop level) and with proper use of a GPU acceleration syntax (such as CUDA, OpenACC, OpenCL, etc.) the GPU can often make such elements of your program run faster.
But the GPU is not designed to be a drop-in replacement for CPU cores. There is the scheduling factor that I mentioned already, as well as the fact that codes generally need to be compiled for the GPU specifically.
Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.
With CPU and memory it's simple.
A process has a large virtual address space, which is partially mapped into physical memory. When the current process attempts to access a page that is not in physical memory, OS steps in, chooses a page to swap (e.g. with Round Robin), swaps it into disc, then reads the required page from the swap, and the control is returned back to the process. This is straightforward, because the process cannot continue without having that page.
GPU kernels is a different story.
Let's consider a usecase:
A high-priority [cpu] process, namely X, makes a call to kernel (which is a blocking call). At this moment, it is reasonable for OS to switch contexts and give the CPU to a different process, namely Z. For the sake of example, let the process Z also do something heavy with the GPU.
Now, what does the GPU driver do? Does it stop the kernel that belongs to [higher prioritized] X? Does it inform OS that Z isn't prioritized enough to offload kernels of X? In general, what happens when two processes need GPU resources, but the available GPU memory is sufficient to serve only one of them at a time?
CUDA GPUs context-switch cooperatively at a coarse granularity (think "memcpy" or "kernel launch"). If there is enough memory for both contexts, the hardware is happy to cooperatively context switch between them at a slight performance cost. (But because it's cooperative, long-running kernels will interfere with other kernels' execution.)
Modern GPUs do support virtual memory (i.e. memory protection through address translation), but they do NOT support demand paging. That means every piece of memory accessible to the GPU (device memory and mapped pinned memory) must be physically present and mapped after allocation.
The Windows Display Driver Model (WDDM) introduced in Windows Vista does paging at a very coarse granularity. The driver is required to track which "memory objects" are needed to execute a given command buffer, and the OS ensures that they are present. The OS can swap them out when not needed. The wrinkle with CUDA is that since pointers can be stored, all memory objects associated with the CUDA address space must be resident in order to run a CUDA kernel. So the paging doesn't work as well for CUDA as it does for graphics applications, which WDDM was designed to run.
I am using a GPU cluster without GPUDirect support. From this briefing, the following is done when transferring GPU data across nodes:
GPU writes to pinned sysmem1
CPU copies from sysmem1 to sysmem2
Infiniband driver copies from sysmem2
Now I am not sure whether the second step is an implicit step when I transfer sysmem1 across Infiniband using MPI. By assuming this, my current programming model is something like this:
cudaMemcpy(hostmem, devicemem, size, cudaMemcpyDeviceToHost).
MPI_Send(hostmem,...)
Is my above assumption true and will my programming model work without causing communication issues?
Yes, you can use CUDA and MPI independently (i.e. without GPUDirect), just as you describe.
Move the data from device to host
Transfer the data as you ordinarily would, using MPI
You might be interested in this presentation, which explains CUDA-aware MPI, and gives an example side-by-side on slide 11 of non-cuda MPI and CUDA-MPI