I was wondering if someone might be able to help me figure out if the new Titan V from nVidia support GPUDirect. As far as I can tell it seems limited to Tesla and Quadro cards.
Thank you for taking the time to read this.
GPUDirect Peer-to-Peer (P2P) is supported between any 2 "like" CUDA GPUs (of compute capability 2.0 or higher), if the system topology supports it, and subject to other requirements and restrictions. In a nutshell, the system topology requirement is that both GPUs participating must be enumerated under the same PCIE root complex. If in doubt, "like" means identical. Other combinations may be supported (e.g. 2 GPUs of the same compute capability) but this is not specified, or advertised as supported. If in doubt, try it out. Finally, these things must be "discoverable" by the GPU driver. If the GPU driver cannot ascertain these facts, and/or the system is not part of a whitelist maintained in the driver, then P2P support will not be possible.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. So the statement here "is supported" should not be construed to refer to a particular GPU type. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.
GPUDirect RDMA is only supported on Tesla and possibly some Quadro GPUs.
So, if you had a system that had 2 Titan V GPUs plugged into PCIE slots that were connected to the same root complex (usually, except in Skylake CPUs, it should be sufficient to say "connected to the same CPU socket"), and the system (i.e. core logic) was recognized by the GPU driver, I would expect P2P to work between those 2 GPUs.
I would not expect GPUDirect RDMA to work to a Titan V, under any circumstances.
YMMV. If in doubt, try it out, before making any large purchasing decisions.
Related
I have read the documentation carefully but still, confused due to the large amount of information for different CUDA versions.
Is it that there is only one default stream on the entire device or
there is one default stream per-process on the HOST CPU? If the answer depends on the version of CUDA, could you also list the situation for different CUDA versions?
By default, CUDA has a per-process default stream. There is a compiler flag --default-stream per-thread which changes the behaviour to per-host-thread default stream, see the documentation.
Note that streams and host threads are programming-level abstractions for hardware details. Even with a single process, there is a limited number of streams you can use concurrently, depending on the hardware. For example, on the Fermi architecture, all streams were multiplexed into a single hardware queue, but since Kepler there are 32 separate hardware queues (see CUDA Streams: Best Practices and Common Pitfalls).
Since the programming guide does not talk about multiple processes in this part, I believe these abstractions do not define the behaviour of multi-process scenarios. As for multi-process, the right term is "CUDA context" which is created for each process and even each host thread (when using the runtime API). The question of how many contexts can be active on a device at the same time: the guide says in 3.4 Compute modes that in the default mode, "Multiple host threads can use the device". Since the following exclusive-process mode talks about CUDA contexts instead, I assume that this means that the description of the default mode covers even multiple host threads from multiple processes.
For more info about multi-process concurrency see e.g. How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?, Unleash legacy MPI codes with Kepler's Hyper-Q and CUDA Streams: Best Practices and Common Pitfalls.
Finally, note that multi-process concurrency works this way since the Kepler architecture, which is the oldest supported architecture nowadays. Since the Pascal architecture there is support for compute preemption (see 3.4 Compute modes for details).
I have access to a computation server which uses an old version of the nvidia driver (346) and cuda (7.0) with applications depending on that specific version of cuda.
Is it possible to upgrade the driver and keep the old cuda?
I could find minimal driver versions but not maximal one.
CUDA generally doesn't enforce any maximum driver version.
Older CUDA toolkits are usable with newer drivers.
The only thing somewhat relevant here is that eventually, from time to time, NVIDIA GPU architectures become "deprecated", and this usually happens first at the driver level. That is, a particular GPU may only be supported up to a certain driver level, at which point support ceases. These GPUs are then in a "legacy" status.
So if your GPU is old enough, it will not be supported by newer/latest drivers. But if you currently have CUDA 7 running correctly, you would have to at least have a Fermi GPU, which is still supported by newest/latest drivers. However Fermi is probably/likely the next GPU family to go into a legacy status, at some point in the future.
I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").
Is there an SBIOS entry or other configuration change that will enable peer-to-peer to work for CUDA across the QPI links that connect I/O hubs (or sockets, in the case of CPUs that integrate the I/O hub - Sandy Bridge and higher)?
No. The QPI link has a protocol which does not entirely cover all features of the PCIE protocol, and in particular some features used by the P2P protocol.
A specific difference is documented in an intel datasheet here.
“The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect.“ (page 135)
So P2P requires a continuous PCIE fabric between the two devices. Both devices need to be on the same PCIE root complex. This particular requirement was publicized by NVIDIA in the CUDA 4.0 timeframe when GPUDirect v2.0 (Peer-to-Peer) was first introduced.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.
Is there an SBIOS entry or other configuration change that will enable peer-to-peer to work for CUDA across the QPI links that connect I/O hubs (or sockets, in the case of CPUs that integrate the I/O hub - Sandy Bridge and higher)?
No. The QPI link has a protocol which does not entirely cover all features of the PCIE protocol, and in particular some features used by the P2P protocol.
A specific difference is documented in an intel datasheet here.
“The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect.“ (page 135)
So P2P requires a continuous PCIE fabric between the two devices. Both devices need to be on the same PCIE root complex. This particular requirement was publicized by NVIDIA in the CUDA 4.0 timeframe when GPUDirect v2.0 (Peer-to-Peer) was first introduced.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.