NvLink or PCIe, how to specify the interconnect? - cuda

My cluster is equipped with both Nvlink and PCIe. All the GPUs(V100) can communicate directly through both PCIe or NvLink. To my knowledge, both PCIe switch and Nvlink can support the direct link through using CUDA.
Now, I want to compare the peer-to-peer communication performance of PCIe and NvLink. However, I don't know how to specify one, it seems CUDA will always automatically specify one. Could anyone help me?

If two GPUs in CUDA have a direct NVLink connection between them, and you enable Peer-to-Peer transfers, those transfers will flow over NVLink. There is no method of any kind in CUDA to alter this behavior.
If you do not enable Peer-to-Peer transfers, then data transfers (e.g. cudaMemcpy, cudaMemcpyAsync, cudaMemcpyPeerAsync) between those two devices will flow from the source GPU over PCIE to the CPU socket, (perhaps traversing intermediate PCIE switches, perhaps also flowing over a socket-level link such as QPI) and then over PCIE from the CPU socket to the other GPU. At least one CPU socket will always be involved, even if a shorter direct path exists across the PCIE fabric. This behavior is also not modifiable in any fashion available to the programmer.
Both methodologies are demonstrated using the p2pBandwidthLatencyTest CUDA sample code.

The accepted answer -- from an NVIDIA employee -- was correct in 2018. But at some point, NVIDIA added an (undocumented?) option to the driver.
On Linux, you can now put this in /etc/modprobe.d/disable-nvlink.conf:
options nvidia NVreg_NvLinkDisable=1
This will disable NVLink when the driver is next loaded, forcing GPU peer-to-peer communication to use the PCIe interconnect. This gadget exists in driver 515.65.01 (CUDA 11.7.1). I am not sure when it was added.
As for "there is no reason to allow the end-user to choose the slower path", the very existence of this SO question suggests otherwise. In my case, we buy not one server, but dozens... And in the process of choosing our configuration, it is nice to use a single prototype system to benchmark our application using either NVLink or PCIe.

Related

Can Unified Memory Migration use NVLink?

I want to check whether unified memory migration (as previously discussed in this page ) across different GPUs can now leveraging the NVLink for the later version of CUDA and GPU architectures.
Yes, unified memory migration can use NVLink.
For a unified memory allocation, this would typically occur when one GPU accesses that allocation, then another GPU accesses that allocation. If those 2 GPUs are in a direct NVLink relationship, the migration of pages from the first to the second will flow over NVLink.
In addition, although you didn't ask about it, NVLink also provides a path for peer-mapped pages, where they do not migrate, but instead a mapping is provided from one GPU to another. The pages may stay on the first GPU, and the 2nd GPU will access them using memory read or write cycles which take place over NVLink.

CUDA Peer-to-Peer across I/O hubs

Is there an SBIOS entry or other configuration change that will enable peer-to-peer to work for CUDA across the QPI links that connect I/O hubs (or sockets, in the case of CPUs that integrate the I/O hub - Sandy Bridge and higher)?
No. The QPI link has a protocol which does not entirely cover all features of the PCIE protocol, and in particular some features used by the P2P protocol.
A specific difference is documented in an intel datasheet here.
“The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect.“ (page 135)
So P2P requires a continuous PCIE fabric between the two devices. Both devices need to be on the same PCIE root complex. This particular requirement was publicized by NVIDIA in the CUDA 4.0 timeframe when GPUDirect v2.0 (Peer-to-Peer) was first introduced.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

CUDA simpleP2P failing [duplicate]

Is there an SBIOS entry or other configuration change that will enable peer-to-peer to work for CUDA across the QPI links that connect I/O hubs (or sockets, in the case of CPUs that integrate the I/O hub - Sandy Bridge and higher)?
No. The QPI link has a protocol which does not entirely cover all features of the PCIE protocol, and in particular some features used by the P2P protocol.
A specific difference is documented in an intel datasheet here.
“The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect.“ (page 135)
So P2P requires a continuous PCIE fabric between the two devices. Both devices need to be on the same PCIE root complex. This particular requirement was publicized by NVIDIA in the CUDA 4.0 timeframe when GPUDirect v2.0 (Peer-to-Peer) was first introduced.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically:
1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to do a copy to get the memory into another Host memory address space, then it can go out to the network card.
2) Do AMD GPUs allow P2P memory transfers when two GPUs are shared on the same PCIe bus, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on transferring data between GPUs on the same PCIe bus: With GPUDirect, data can move directly between GPUs on the same PCIe bus, without touching host memory. Without GPUDirect, data always has to go back to the host before it can get to another GPU, regardless of where that GPU is located.
Edit: BTW, I'm not entirely sure how much of GPUDirect is vaporware and how much of it is actually useful. I've never actually heard of a GPU programmer using it for something real. Thoughts on this are welcome too.
Although this question is pretty old, I would like to add my answer as I believe the current information here is incomplete.
As stated in the answer by #Ani, you could allocate a host memory using CL_MEM_ALLOC_HOST_PTR and you will most likely get a pinned host memory that avoids the second copy depending on the implementation. For instance, NVidia OpenCL Best Practices Guide states:
OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance
The thing I find missing from previous answers is the fact that AMD offers DirectGMA technology. This technology enables you to transfer data between the GPU and any other peripheral on the PCI bus (including other GPUs) directly witout having to go through system memory. It is more similar to NVidia's RDMA (not available on all platforms).
In order to use this technology you must:
have a compatible AMD GPU (not all of them support DirectGMA). you can use either OpenCL, DirectX or OpenGL extentions provided by AMD.
have the peripheral driver (network card, video capture card etc) either expose a physical address to which the GPU DMA engine can read/write from/to. Or be able to program the peripheral DMA engine to transfer data to / from the GPU exposed memory.
I used this technology to transfer data directly from video capture devices to the GPU memory and from the GPU memory to a proprietary FPGA. Both cases were very efficent and did not involve any extra copying.
Interfacing OpenCL with PCIe devices
I think you may be looking for the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer. While the OpenCL specification states that this flag "This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory", it is uncertain what AMD's implementation (or other implementations) might do with it.
Here's an informative thread on the topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440
Hope this helps.
Edit: I do know that nVidia's OpenCL SDK implements this as allocation in pinned/page-locked memory. I am fairly certain this is what AMD's OpenCL SDK does when running on the GPU.
As pointed out by #ananthonline and #harrism, many of the features of GPUDirect have no direct equivalent in OpenCL. However, if you are trying to reduce memory transfer overhead, as mentioned in the first sentence of your question, zero copy memory might help. Normally, when an application creates a buffer on the GPU, the contents of the buffer are copied from CPU memory to GPU memory en masse. With zero copy memory, there is no upfront copy; instead, data is copied over as it is accessed by the GPU kernel.
Zero copy does not make sense for all applications. Here is advice from the AMD APP OpenCL Programming Guide on when to use it:
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large
host memory buffer is shared between multiple devices and the copies
are too expensive. When choosing this, the cost of the transfer must
be greater than the extra cost of the slower accesses.
Table 4.3 of the Programming Guide describes which flags to pass to clCreateBuffer to take advantage of zero copy (either CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_PERSISTENT_MEM_AMD, depending on whether you want device-accessible host memory or host-accessible device memory). Note that zero copy support is dependent on both the OS and the hardware; it appears to not be supported under Linux or older versions of Windows.
AMD APP OpenCL Programming Guide: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

How to choose device when running a CUDA executable?

I'm connecting to a GPU cluster from the outside and I have no idea how to select the device on which to run my CUDA programs.
I know there are two Tesla GPU in the cluster, and I'd like to choose one of them.
Any ideas how? How do you choose the device you want to use when there are many connected to your computer?
The canonical way to select a device in the runtime API is using cudaSetDevice. That will configure the runtime to perform lazy context establishment on the nominated device. Prior to CUDA 4.0, this call didn't actually establish a context, it just told the runtime which GPU to try and use. Since CUDA 4.0, this call will establish a context on the nominated GPU at the time of calling. There is also cudaChooseDevice, which will select amongst available devices to find one which matches criteria supplied by the caller.
You can enumerate the available GPUs on a system with cudaGetDeviceCount, and retrieve their particulars using cudaGetDeviceProperties. The SDK deviceQuery example shows full details of how to do this.
You may need to be careful, however, on how you select GPUs in a multi-GPU system, depending on the host and driver configuration. In both the Linux and the Windows TCC driver, there exists the option for GPUs to be marked "compute exculsive", meaning that the driver will limit each GPU to one active context at a time, or compute prohibited, meaning that no CUDA program can establish a context on that device. If your code attempts to establish a context on a compute prohibited device, or on a compute exclusive device which is in use, the result will be an invalid device error. In a multiple GPU system where the policy is to use compute exclusivity, the correct approach is not to try and select a particular GPU, but simply to allow lazy context establishment to happen implicitly. The driver will automagically select a free GPU for your code to run. The compute mode status of any device can be checked by reading the cudaDeviceProp.computeMode field using the cudaGetDeviceProperties call. Note that you are free to check unavailable or prohibited GPUs and query their properties, but any operation which would require context establishment will fail.
See the runtime API documentation on all of these calls
You can set the environment variable CUDA_VISIBLE_DEVICES to a comma-separated list of device IDs to make only those devices visible to an application. Use this either to mask out devices or to change the visibility order of devices so that the CUDA runtime enumerates them in a specific order.