How to use GPUDirect RDMA with Infiniband - cuda

I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations.
I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI has support for the feature I am asking. But OpenMPI is too heavy weight for this simple task and it does not support multiple GPUs in a single process.
I wonder if I could get any help with directly using the driver to do the communication. Code sample, tutorial, anything would be good. Also, I would appreciate it if anyone could help me find the code dealing with this in OpenMPI.

For GPUDirect RDMA to work, you need the following installed:
Mellanox OFED installed (from http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers )
Recent NVIDIA CUDA suite installed
Mellanox-NVIDIA GPUDirect plugin (from the link you gave above - posting as guest prevents me from posting links :( )
All of the above should be installed (by the order listed above), and the relevant modules loaded.
After that, you should be able to register memory allocated on the GPU video memory for RDMA transactions. Sample code will look like:
void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);
This will create (on a GPUDirect RDMA enabled system) a memory region, with a valid memory key that you can use for RDMA transactions with our HCA.
For more details about using RDMA and InfiniBand verbs in your code, you can refer to this document.

Related

Can I use cudaMemcpyPeer to transfer data between different gpus assigned by MPI?

I use mpi to generate multiple processes, each process corresponds to a gpu device. I used MPI_Send to transfer data before, but its speed is too slow.
I found that the transfer speed using cudaMemcpyPeer is very fast, but I don’t know if I can use cudaMemcpyPeer or cudaMemcpyPeerAsync to transfer data in the MPI environment.
The solution for this case is to use CUDA-aware MPI. It is a special version of MPI that understands CUDA usage. In particular it allows you to use CUDA device pointers as buffer pointers in calls such as MPI_Send, MPI_Recv, and MPI_SendRecv, and will use the fastest possible means provided by CUDA (such as peer transfers between 2 GPUs, in the same machine, when possible) to do the data movement.
Various MPI distributions such as OpenMPI and MVAPICH have CUDA-enabled versions.
You can find more info about it by reading this blog. You can also find questions about it here on the cuda tag such as this one.

NvLink or PCIe, how to specify the interconnect?

My cluster is equipped with both Nvlink and PCIe. All the GPUs(V100) can communicate directly through both PCIe or NvLink. To my knowledge, both PCIe switch and Nvlink can support the direct link through using CUDA.
Now, I want to compare the peer-to-peer communication performance of PCIe and NvLink. However, I don't know how to specify one, it seems CUDA will always automatically specify one. Could anyone help me?
If two GPUs in CUDA have a direct NVLink connection between them, and you enable Peer-to-Peer transfers, those transfers will flow over NVLink. There is no method of any kind in CUDA to alter this behavior.
If you do not enable Peer-to-Peer transfers, then data transfers (e.g. cudaMemcpy, cudaMemcpyAsync, cudaMemcpyPeerAsync) between those two devices will flow from the source GPU over PCIE to the CPU socket, (perhaps traversing intermediate PCIE switches, perhaps also flowing over a socket-level link such as QPI) and then over PCIE from the CPU socket to the other GPU. At least one CPU socket will always be involved, even if a shorter direct path exists across the PCIE fabric. This behavior is also not modifiable in any fashion available to the programmer.
Both methodologies are demonstrated using the p2pBandwidthLatencyTest CUDA sample code.
The accepted answer -- from an NVIDIA employee -- was correct in 2018. But at some point, NVIDIA added an (undocumented?) option to the driver.
On Linux, you can now put this in /etc/modprobe.d/disable-nvlink.conf:
options nvidia NVreg_NvLinkDisable=1
This will disable NVLink when the driver is next loaded, forcing GPU peer-to-peer communication to use the PCIe interconnect. This gadget exists in driver 515.65.01 (CUDA 11.7.1). I am not sure when it was added.
As for "there is no reason to allow the end-user to choose the slower path", the very existence of this SO question suggests otherwise. In my case, we buy not one server, but dozens... And in the process of choosing our configuration, it is nice to use a single prototype system to benchmark our application using either NVLink or PCIe.

cuda-mpi programming model without GPUDirect

I am using a GPU cluster without GPUDirect support. From this briefing, the following is done when transferring GPU data across nodes:
GPU writes to pinned sysmem1
CPU copies from sysmem1 to sysmem2
Infiniband driver copies from sysmem2
Now I am not sure whether the second step is an implicit step when I transfer sysmem1 across Infiniband using MPI. By assuming this, my current programming model is something like this:
cudaMemcpy(hostmem, devicemem, size, cudaMemcpyDeviceToHost).
MPI_Send(hostmem,...)
Is my above assumption true and will my programming model work without causing communication issues?
Yes, you can use CUDA and MPI independently (i.e. without GPUDirect), just as you describe.
Move the data from device to host
Transfer the data as you ordinarily would, using MPI
You might be interested in this presentation, which explains CUDA-aware MPI, and gives an example side-by-side on slide 11 of non-cuda MPI and CUDA-MPI

How to choose a non busy CUDA device?

I'm working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can't launch "nvidia-smi" to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.
My question is: There is a way to check at runtime if the device is busy or not?
Thanks
Your cluster managers should install and use cluster management (job-scheduling) software that allows them to assign and track GPUs just like CPUs and memory. There are a number of job schedulers that can do this. Even without explicit GPU support in the job-scheduler, it's possible to build job entry/exit scripts that will assign GPUs properly.
You can effectively include the same functionality that nvidia-smi uses by embedding NVML in your applications. Any query or data item reported on by nvidia-smi can be accessed programmatically through NVML.
It's also not clear to me why you could not launch a script for your job which checks which devices are busy using nvidia-smi, then picks an un-busy device.
But keep in mind that any runtime check you might do would be subject to the behavior of other applications. If those applications (whether launched by you or other users) have unusual behavior, your runtime check can easily be defeated.

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the accelerated CUDA code. What I want is to first check whether the user's system has a CUDA enabled NVIDIA card and if it does, initializing the CUDA runtime after. If the system does not support CUDA, then I want to execute the CPU path. This question is very similar to mine, but I don't want to use any other libraries other than the plain CUDA runtime. OpenCL is an alternative, but there isn't enough time to implement an OpenCL version of the algorithm for the first release. Without any CUDA existence check, the program will surely crash since it can't find the needed .dll's for the CUDA runtime and we surely don't want that. So, I need advices on how to handle this initialization step.
Use the calls cudaGetDeviceCount and cudaGetDeviceProperties to find CUDA devices on the running system. First find out how many, then loop through all the available devices, and inspect the properties to decide which ones qualify. What I mean by "qualify" depends on your application. Do you want to require a certain compute capability? Or need a certain amount of memory? If there's more than one device, you might want to sort on some criteria, then set the device cudaSetDevice. If there are no devices, or none that are sufficient, then fall back on the CPU code path.
I'd also suggest having some mechanism to force CUDA mode off, in case some user's environment just doesn't work due to driver issues, or an old board, or something else. You can use a command-line option, or an environment variable, or whatever...
EDITING:
Regarding DLLs, you should package cudart[whatever].dll with your application. That will ensure that the program starts, and at least the CUDA query functions will operate.