Can Unified Memory Migration use NVLink? - cuda

I want to check whether unified memory migration (as previously discussed in this page ) across different GPUs can now leveraging the NVLink for the later version of CUDA and GPU architectures.

Yes, unified memory migration can use NVLink.
For a unified memory allocation, this would typically occur when one GPU accesses that allocation, then another GPU accesses that allocation. If those 2 GPUs are in a direct NVLink relationship, the migration of pages from the first to the second will flow over NVLink.
In addition, although you didn't ask about it, NVLink also provides a path for peer-mapped pages, where they do not migrate, but instead a mapping is provided from one GPU to another. The pages may stay on the first GPU, and the 2nd GPU will access them using memory read or write cycles which take place over NVLink.

Related

What are Cuda shared memory capabilities

I am a newbie in cuda. According to my knowledge I must use global memory to make blocks communicate with each other, but my understanding of the stream concept and memory capabilities stuck somewhere. After searching I figured that streams queue multiple kernels in sequence and can be used to apply different kernels on different blocks.
Now I NEED to exchange arrays between 2 blocks or more. Can kernel be used to swap or exchange data within shared memory between blocks without involving global/device memory.
if I allocated block for each sub population to calculate fitness using some kernel and shared memory, can I transfer data between blocks
No. Shared memory has block scope. It is not portable between blocks. Global memory or heap memory is portable and could potentially be used to hold data to be accessed by multiple blocks.
However, the standard execution model in CUDA doesn't support grid level synchronization. Since CUDA 9, and with the newest generations of hardware, there is support for a grid level synchronization mechanism if you use cooperative groups, however neither PyCUDA nor Numba expose that facility as far as I am aware.

Any guarantees that Torch won't mess up with an already allocated CUDA array?

Assume we allocated some array on our GPU through other means than PyTorch, for example by creating a GPU array using numba.cuda.device_array. Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array? In general, since PyTorch and Numba use the same CUDA runtime and thus I assume the same mechanism for memory management, are they automatically aware of memory regions used by other CUDA programs or does each one of them see the entire GPU memory as his own? If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
EDIT: figured this would be an important assumption: assume that all allocations are done by the same process.
Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array?
No.
are they automatically aware of memory regions used by other CUDA programs ...
They are not "aware", but each process gets its own separate context ...
... or does each one of them see the entire GPU memory as his own?
.... and contexts have their own address spaces and isolation. So neither, but there is no risk of memory corruption.
If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
If by "aware" you mean "safe", then that happens automatically. If by "aware" you imply some sort of interoperability, then that is possible on some platforms, but it is not automatic.
... assume that all allocations are done by the same process.
That is a different situation. In general, the same process implies a shared context, and shared contexts share a memory space, but all the normal address space protection rules and facilities apply, so there is not a risk of loss of safety.

Atomic operations in CUDA kernels on mapped pinned host memory: to do or not to do?

In CUDA programming guide it is stated that atomic operations on mapped pinned host memory "are not atomic from the point of view of the host or other devices." What I get from this sentence is that if the host memory region is accessed only by one GPU, it is fine to do atomic on the mapped pinned host memory (even from within multiple simultaneous kernels).
On the other hand, in the book the CUDA Handbook by Nicholas Wilt at page 128 it is stated that:
Do not try to use atomics on mapped pinned host memory, either for the host (locked compare-exchange) or the device (atomicAdd()). On the CPU side, the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus. Conversely, on the GPU side, atomic operations only work on local device memory locations because they are implemented using the GPU's local memory controller.
Is is safe to do atomic from inside a CUDA kernel on mapped pinned host memory? Can we rely on PCI-e bus to keep the atomicity of atomics' read-modify-write?
The caution is intended for people who are using mapped pinned memory to coordinate execution between the CPU and GPU, or between multiple GPUs. When I wrote that, I did not expect anyone to use such a mechanism in the single-GPU case because CUDA provides so many other, better ways to coordinate execution between the CPU(s) and a single GPU.
If there is strictly a producer/consumer relationship between the CPU and GPU (i.e. the producer is updating the memory location and the consumer is passively reading it), that can be expected to work under certain circumstances.
If the GPU is the producer, the CPU would see updates to the memory location as they get posted out of the GPU’s L2 cache. But the GPU code may have to execute memory barriers to force that to happen; and even if that code works on x86, it’d likely break on ARM without heroic measures because ARM does not snoop bus traffic.
If the CPU is the producer, the GPU would have to bypass the L2 cache because it is not coherent with CPU memory.
If the CPU and GPU are trying to update the same memory location concurrently, there is no mechanism to ensure atomicity between the two. Doing CPU atomics will ensure that the update it atomic with respect to CPU code, and doing GPU atomics will ensure that the update is atomic with respect to the GPU that is doing the update.
All of the foregoing discussion assumes there is only one GPU; if multiple GPUs are involved, all bets are off. Although atomics are provided for in the PCI Express 3.0 bus specification, I don’t believe they are supported by NVIDIA GPUs. And support in the underlying platform also is not guaranteed.
It seems to me that whatever a developer may be trying to accomplish by doing atomics on mapped pinned memory, there’s probably a method that is faster, more likely to work, or both.
Yes, this works atomically from a single GPU. So if no other CPU or GPU is accessing the memory it will be atomic. Atomics are implemented in the L2 cache and the CROP (on various GPUs), and both can handle system memory accesses.
It will be slow, though. This memory is not cached on the GPU.
When Nick says, "the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus", it makes me think he's referring to the lack of atomicity when accessing that memory from both processors, which is correct.

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically:
1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to do a copy to get the memory into another Host memory address space, then it can go out to the network card.
2) Do AMD GPUs allow P2P memory transfers when two GPUs are shared on the same PCIe bus, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on transferring data between GPUs on the same PCIe bus: With GPUDirect, data can move directly between GPUs on the same PCIe bus, without touching host memory. Without GPUDirect, data always has to go back to the host before it can get to another GPU, regardless of where that GPU is located.
Edit: BTW, I'm not entirely sure how much of GPUDirect is vaporware and how much of it is actually useful. I've never actually heard of a GPU programmer using it for something real. Thoughts on this are welcome too.
Although this question is pretty old, I would like to add my answer as I believe the current information here is incomplete.
As stated in the answer by #Ani, you could allocate a host memory using CL_MEM_ALLOC_HOST_PTR and you will most likely get a pinned host memory that avoids the second copy depending on the implementation. For instance, NVidia OpenCL Best Practices Guide states:
OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance
The thing I find missing from previous answers is the fact that AMD offers DirectGMA technology. This technology enables you to transfer data between the GPU and any other peripheral on the PCI bus (including other GPUs) directly witout having to go through system memory. It is more similar to NVidia's RDMA (not available on all platforms).
In order to use this technology you must:
have a compatible AMD GPU (not all of them support DirectGMA). you can use either OpenCL, DirectX or OpenGL extentions provided by AMD.
have the peripheral driver (network card, video capture card etc) either expose a physical address to which the GPU DMA engine can read/write from/to. Or be able to program the peripheral DMA engine to transfer data to / from the GPU exposed memory.
I used this technology to transfer data directly from video capture devices to the GPU memory and from the GPU memory to a proprietary FPGA. Both cases were very efficent and did not involve any extra copying.
Interfacing OpenCL with PCIe devices
I think you may be looking for the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer. While the OpenCL specification states that this flag "This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory", it is uncertain what AMD's implementation (or other implementations) might do with it.
Here's an informative thread on the topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440
Hope this helps.
Edit: I do know that nVidia's OpenCL SDK implements this as allocation in pinned/page-locked memory. I am fairly certain this is what AMD's OpenCL SDK does when running on the GPU.
As pointed out by #ananthonline and #harrism, many of the features of GPUDirect have no direct equivalent in OpenCL. However, if you are trying to reduce memory transfer overhead, as mentioned in the first sentence of your question, zero copy memory might help. Normally, when an application creates a buffer on the GPU, the contents of the buffer are copied from CPU memory to GPU memory en masse. With zero copy memory, there is no upfront copy; instead, data is copied over as it is accessed by the GPU kernel.
Zero copy does not make sense for all applications. Here is advice from the AMD APP OpenCL Programming Guide on when to use it:
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large
host memory buffer is shared between multiple devices and the copies
are too expensive. When choosing this, the cost of the transfer must
be greater than the extra cost of the slower accesses.
Table 4.3 of the Programming Guide describes which flags to pass to clCreateBuffer to take advantage of zero copy (either CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_PERSISTENT_MEM_AMD, depending on whether you want device-accessible host memory or host-accessible device memory). Note that zero copy support is dependent on both the OS and the hardware; it appears to not be supported under Linux or older versions of Windows.
AMD APP OpenCL Programming Guide: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.