Multi GPU vs GPU cluster - cuda

I am using cuda programming for the effective and fast computation. and during the study I found that multi gpu and the gpu cluster are the other means for the much further effective calculation but I am confused between these two terms.
What is the actual difference between these two in terms of programming cuda?

I assume that you mean a PC with multiple GPUs and many PCs with single GPU (Cluster)
if this is the case, for a multi-GPU PC you can easily use CUDA library itself and if you connect GPUs with a SLI bridge, you will see improvements in performance.
If you want to use a cluster with GPUs, you may use CUDA-Aware MPI. It is combined solution of MPI standard and CUDA library. I suggest you to check this blog post: https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

Related

Can I use cudaMemcpyPeer to transfer data between different gpus assigned by MPI?

I use mpi to generate multiple processes, each process corresponds to a gpu device. I used MPI_Send to transfer data before, but its speed is too slow.
I found that the transfer speed using cudaMemcpyPeer is very fast, but I don’t know if I can use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data in the MPI environment.
The solution for this case is to use CUDA-aware MPI. It is a special version of MPI that understands CUDA usage. In particular it allows you to use CUDA device pointers as buffer pointers in calls such as MPI_Send, MPI_Recv, and MPI_SendRecv, and will use the fastest possible means provided by CUDA (such as peer transfers between 2 GPUs, in the same machine, when possible) to do the data movement.
Various MPI distributions such as OpenMPI and MVAPICH have CUDA-enabled versions.
You can find more info about it by reading this blog. You can also find questions about it here on the cuda tag such as this one.

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.

Do all GPUs use the same architecture?

I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").

What hardware setup is required to use MPI with CUDA?

I am new to MPI. I want use CUDA with MPI. I am having three PCs, each having one GPU, which I want to use for doing some simple processing (matrix multiplication).
But I am not sure what hardware setup is required to use MPI with CUDA?
Please enlighten me.
Update
I am asking this as many a place mentions clusters with infiniband. I do not have such a set up. I only have ordinary Lan that we have in offices.
And above all the basic idea is to have a feel of how MPI and CUDA work together and do small small tests runs--irrespective of the performance.
One or more machines with nVidia GPUs that are capable of CUDA.
MPI and CUDA don't have anything to do with each other. You simply use CUDA within each MPI process.
But, but way of a followup to the OP's original question, if I may?
I realize that #gpuguy's Q was about hardware, but isn't it true that he must be running one of the OS options the Nvidia CUDA compilers supports? (IE, Linux, Win, OSX)
There is no OpenSource equivalent of CUDA, is there?

Multiple GPUs with Cuda Thrust?

How do I use Thrust with multiple GPUs?
Is it simply a matter of using cudaSetDevice(deviceId)
and then running the relevant Thrust code?
With CUDA 4.0 or later, cudaSetDevice(deviceId) followed by your thrust code should work.
Just keep in mind that you will need to create and operate on separate vectors on each device (unless you have devices that support peer-to-peer memory access and PCI-express bandwidth is sufficient for your task).