cuda-mpi programming model without GPUDirect

cuda-mpi programming model without GPUDirect - cuda

I am using a GPU cluster without GPUDirect support. From this briefing, the following is done when transferring GPU data across nodes:
GPU writes to pinned sysmem1
CPU copies from sysmem1 to sysmem2
Infiniband driver copies from sysmem2
Now I am not sure whether the second step is an implicit step when I transfer sysmem1 across Infiniband using MPI. By assuming this, my current programming model is something like this:
cudaMemcpy(hostmem, devicemem, size, cudaMemcpyDeviceToHost).
MPI_Send(hostmem,...)
Is my above assumption true and will my programming model work without causing communication issues?

Yes, you can use CUDA and MPI independently (i.e. without GPUDirect), just as you describe.
Move the data from device to host
Transfer the data as you ordinarily would, using MPI
You might be interested in this presentation, which explains CUDA-aware MPI, and gives an example side-by-side on slide 11 of non-cuda MPI and CUDA-MPI

Related

Can I use cudaMemcpyPeer to transfer data between different gpus assigned by MPI?

I use mpi to generate multiple processes, each process corresponds to a gpu device. I used MPI_Send to transfer data before, but its speed is too slow.
I found that the transfer speed using cudaMemcpyPeer is very fast, but I don’t know if I can use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data in the MPI environment.

The solution for this case is to use CUDA-aware MPI. It is a special version of MPI that understands CUDA usage. In particular it allows you to use CUDA device pointers as buffer pointers in calls such as MPI_Send, MPI_Recv, and MPI_SendRecv, and will use the fastest possible means provided by CUDA (such as peer transfers between 2 GPUs, in the same machine, when possible) to do the data movement.
Various MPI distributions such as OpenMPI and MVAPICH have CUDA-enabled versions.
You can find more info about it by reading this blog. You can also find questions about it here on the cuda tag such as this one.

Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))

So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?

The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).

If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."

You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.

Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)

Programming multi-GPU applications with one kernel spanning several cards/

I am interested in using CUDA to program a multi-GPU application.
As far as I know, one can use multiple GPU's to execute 2 or more kernels execute simultaneously in parallel. Each kernel's data resides on the GPU it is executing on.
But what if I want my data and kernel operation to span several cards. How does one do this?
The simpleMultiGPU example in the CUDA SDK is not what I want since it basically launches the same kernel on multiple GPUs. No inter GPU communication is present, which is what I am interested in.

It sounds like you're interested in Unified Virtual Addressing (UVA) and P2P communication. Consult http://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf . You should not be communicating between different CUDA blocks anyway, but the techniques I mention should at least allow you to read data and write data across multiple GPUs, access the data in more flexible ways.

What is difference between monolithic and micro kernel?

Could anyone please explain with examples difference between monolithic and micro kernel? Also other classifications of the kernel?

Monolithic kernel is a single large process running entirely in a single address space. It is a single static binary file. All kernel services exist and execute in the kernel address space. The kernel can invoke functions directly. Examples of monolithic kernel based OSs: Unix, Linux.
In microkernels, the kernel is broken down into separate processes, known as servers. Some of the servers run in kernel space and some run in user-space. All servers are kept separate and run in different address spaces. Servers invoke "services" from each other by sending messages via IPC (Interprocess Communication). This separation has the advantage that if one server fails, other servers can still work efficiently. Examples of microkernel based OSs: Mac OS X and Windows NT.

Monolithic kernel design is much older than the microkernel idea, which appeared at the end of the 1980's.
Unix and Linux kernels are monolithic, while QNX, L4 and Hurd are microkernels. Mach was initially a microkernel (not Mac OS X), but later converted into a hybrid kernel. Minix (before version 3) wasn't a pure microkernel because device drivers were compiled as part of the kernel.
Monolithic kernels are usually faster than microkernels. The first microkernel Mach was 50% slower than most monolithic kernels, while later ones like L4 were only 2% or 4% slower than the monolithic designs.
Monolithic kernels are big in size, while microkernels are small in size - they usually fit into the processor's L1 cache (first generation microkernels).
In monolithic kernels, the device drivers reside in the kernel space while in the microkernels the device drivers are user-space.
Since monolithic kernels' device drivers reside in the kernel space, monolithic kernels are less secure than microkernels, and failures (exceptions) in the drivers may lead to crashes (displayed as BSODs in Windows). Microkernels are more secure than monolithic kernels, hence more often used in military devices.
Monolithic kernels use signals and sockets to implement inter-process communication (IPC), microkernels use message queues. 1st gen microkernels didn't implement IPC well and were slow on context switches - that's what caused their poor performance.
Adding a new feature to a monolithic system means recompiling the whole kernel or the corresponding kernel module (for modular monolithic kernels), whereas with microkernels you can add new features or patches without recompiling.

Monolithic kernel
All the parts of a kernel like the Scheduler, File System, Memory Management, Networking Stacks, Device Drivers, etc., are maintained in one unit within the kernel in Monolithic Kernel
Advantages
•Faster processing
Disadvantages
•Crash Insecure
•Porting Inflexibility
•Kernel Size explosion
Examples
•MS-DOS, Unix, Linux
Micro kernel
Only the very important parts like IPC(Inter process Communication), basic scheduler, basic memory handling, basic I/O primitives etc., are put into the kernel. Communication happen via message passing. Others are maintained as server processes in User Space
Advantages
•Crash Resistant, Portable, Smaller Size
Disadvantages
•Slower Processing due to additional Message Passing
Examples
•Windows NT

1.Monolithic Kernel (Pure Monolithic) :all
All Kernel Services From single component
(-) addition/removal is not possible, less/Zero flexible
(+) inter Component Communication is better
e.g. :- Traditional Unix
2.Micro Kernel :few
few services(Memory management ,CPU management,IPC etc) from core kernel, other services(File management,I/O management. etc.) from different layers/component
Split Approach [Some services is in privileged(kernel) mode and some are in Normal(user) mode]
(+)flexible for changes/up-gradations
(-)communication overhead
e.g.:- QNX etc.
3.Modular kernel(Modular Monolithic) :most
Combination of Micro and Monolithic kernel
Collection of Modules -- modules can be --> Static + Dynamic
Drivers come in the form of Modules
e.g. :- Linux Modern OS

In the spectrum of kernel designs the two extreme
points are monolithic kernels and microkernels.
The (classical) Linux
kernel for instance is a monolithic kernel (and so is every commercial OS
to date as well - though they might claim otherwise);
In that its code is a
single C file giving rise to a single process that implements all of the above
services.
To exemplify the encapsulation of the Linux kernel we remark that
the Linux kernel does not even have access to any of the standard C libraries.
Indeed the Linux kernel cannot use rudimentary C library functions such as
printf. Instead it implements its own printing function (called prints).
This seclusion of the Linux kernel and self-containment provide Linux kernel
with its main advantage: the kernel resides in a single address space1
enabling
all features to communicate in the fastest way possible without resorting to
any type of message passing.
In particular, a monolithic kernel implements all of the device drivers
of the system.This however is the main drawback of a monolithic kernel:
introduction of any new unsupported hardware requires a rewrite of the
kernel (in the relevant parts), recompilation of it, and re-installing the entire
OS.More importantly, if any device driver crashes the entire kernel suffers
as a result.
This un-modular approach to hardware additions and hardware crashes
is the main argument for supporting the other extreme design approach
for kernels. A microkernel is in a sense a minimalistic kernel that houses
only the very basic of OS services (like process management and file system
management). In a microkernel the device drivers lie outside of the kernel
allowing for addition and removal of device drivers while the OS is running
and require no alternations of the kernel.

Monolithic kernel has all kernel services along with kernel core part, thus are heavy and has negative impact on speed and performance. On the other hand micro kernel is lightweight causing increase in performance and speed.
I answered same question at wordpress site.
For the difference between monolithic, microkernel and exokernel in tabular form, you can visit here

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008