What is the difference between Nvidia Hyper Q and Nvidia Streams? - cuda

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?

Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable to a higher value if you need more.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)

Multi-GPU programming strategies using CUDA

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
Single host thread - multiple devices (shared context)
Single host thread - concurrent execution of kernels on a single device (shared context)
Multiple host threads - (Equal) Multiple devices (independent contexts)
Single host thread - Sequential kernel execution on one device
Multiple host threads - concurrent execution of kernels on one device (independent contexts)
Multiple host threads - sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.
Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.
Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.
Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.
Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.

What is difference between monolithic and micro kernel?

Could anyone please explain with examples difference between monolithic and micro kernel? Also other classifications of the kernel?
Monolithic kernel is a single large process running entirely in a single address space. It is a single static binary file. All kernel services exist and execute in the kernel address space. The kernel can invoke functions directly. Examples of monolithic kernel based OSs: Unix, Linux.
In microkernels, the kernel is broken down into separate processes, known as servers. Some of the servers run in kernel space and some run in user-space. All servers are kept separate and run in different address spaces. Servers invoke "services" from each other by sending messages via IPC (Interprocess Communication). This separation has the advantage that if one server fails, other servers can still work efficiently. Examples of microkernel based OSs: Mac OS X and Windows NT.
Monolithic kernel design is much older than the microkernel idea, which appeared at the end of the 1980's.
Unix and Linux kernels are monolithic, while QNX, L4 and Hurd are microkernels. Mach was initially a microkernel (not Mac OS X), but later converted into a hybrid kernel. Minix (before version 3) wasn't a pure microkernel because device drivers were compiled as part of the kernel.
Monolithic kernels are usually faster than microkernels. The first microkernel Mach was 50% slower than most monolithic kernels, while later ones like L4 were only 2% or 4% slower than the monolithic designs.
Monolithic kernels are big in size, while microkernels are small in size - they usually fit into the processor's L1 cache (first generation microkernels).
In monolithic kernels, the device drivers reside in the kernel space while in the microkernels the device drivers are user-space.
Since monolithic kernels' device drivers reside in the kernel space, monolithic kernels are less secure than microkernels, and failures (exceptions) in the drivers may lead to crashes (displayed as BSODs in Windows). Microkernels are more secure than monolithic kernels, hence more often used in military devices.
Monolithic kernels use signals and sockets to implement inter-process communication (IPC), microkernels use message queues. 1st gen microkernels didn't implement IPC well and were slow on context switches - that's what caused their poor performance.
Adding a new feature to a monolithic system means recompiling the whole kernel or the corresponding kernel module (for modular monolithic kernels), whereas with microkernels you can add new features or patches without recompiling.
Monolithic kernel
All the parts of a kernel like the Scheduler, File System, Memory Management, Networking Stacks, Device Drivers, etc., are maintained in one unit within the kernel in Monolithic Kernel
Advantages
•Faster processing
Disadvantages
•Crash Insecure
•Porting Inflexibility
•Kernel Size explosion
Examples
•MS-DOS, Unix, Linux
Micro kernel
Only the very important parts like IPC(Inter process Communication), basic scheduler, basic memory handling, basic I/O primitives etc., are put into the kernel. Communication happen via message passing. Others are maintained as server processes in User Space
Advantages
•Crash Resistant, Portable, Smaller Size
Disadvantages
•Slower Processing due to additional Message Passing
Examples
•Windows NT
1.Monolithic Kernel (Pure Monolithic) :all
All Kernel Services From single component
(-) addition/removal is not possible, less/Zero flexible
(+) inter Component Communication is better
e.g. :- Traditional Unix
2.Micro Kernel :few
few services(Memory management ,CPU management,IPC etc) from core kernel, other services(File management,I/O management. etc.) from different layers/component
Split Approach [Some services is in privileged(kernel) mode and some are in Normal(user) mode]
(+)flexible for changes/up-gradations
(-)communication overhead
e.g.:- QNX etc.
3.Modular kernel(Modular Monolithic) :most
Combination of Micro and Monolithic kernel
Collection of Modules -- modules can be --> Static + Dynamic
Drivers come in the form of Modules
e.g. :- Linux Modern OS
In the spectrum of kernel designs the two extreme
points are monolithic kernels and microkernels.
The (classical) Linux
kernel for instance is a monolithic kernel (and so is every commercial OS
to date as well - though they might claim otherwise);
In that its code is a
single C file giving rise to a single process that implements all of the above
services.
To exemplify the encapsulation of the Linux kernel we remark that
the Linux kernel does not even have access to any of the standard C libraries.
Indeed the Linux kernel cannot use rudimentary C library functions such as
printf. Instead it implements its own printing function (called prints).
This seclusion of the Linux kernel and self-containment provide Linux kernel
with its main advantage: the kernel resides in a single address space1
enabling
all features to communicate in the fastest way possible without resorting to
any type of message passing.
In particular, a monolithic kernel implements all of the device drivers
of the system.This however is the main drawback of a monolithic kernel:
introduction of any new unsupported hardware requires a rewrite of the
kernel (in the relevant parts), recompilation of it, and re-installing the entire
OS.More importantly, if any device driver crashes the entire kernel suffers
as a result.
This un-modular approach to hardware additions and hardware crashes
is the main argument for supporting the other extreme design approach
for kernels. A microkernel is in a sense a minimalistic kernel that houses
only the very basic of OS services (like process management and file system
management). In a microkernel the device drivers lie outside of the kernel
allowing for addition and removal of device drivers while the OS is running
and require no alternations of the kernel.
Monolithic kernel has all kernel services along with kernel core part, thus are heavy and has negative impact on speed and performance. On the other hand micro kernel is lightweight causing increase in performance and speed.
I answered same question at wordpress site.
For the difference between monolithic, microkernel and exokernel in tabular form, you can visit here