GPU Memory Allocation under CUDA 8 and Pascal Architecture - cuda

Pascal Architecture has brought an amazing feature for CUDA developers by upgrading the unified memory behavior, allowing them to allocate GPU memory way bigger than available on the system.
I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).

I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).
First of all I suggest you do proper CUDA error checking on all CUDA API calls. It would seem from your description that you are not.
demand-paging in unified memory (UM) allowing the increase in memory size beyond the GPU physical DRAM memory will only work with:
Pascal (or future) GPUs
CUDA 8 (or future) toolkit
Other than that, your setup should probably work. If it's not working for you with CUDA 8 (not CUDA 8RC) and a Pascal GPU, make sure that you meet the requirements (e.g. OS) for UM and also do proper error checking. Rather than trying to infer what is happening from nvidia-smi, run an actual test on the allocated memory.
For a more general description of the feature I refer you to this blog article.

Related

How to access Managed memory simultaneously by CPU and GPU in compute capability 5.0?

Since simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible (CUDA Toolkit Documentation), is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0 or any method that can make CPU access managed memory while GPU kernel is running.
is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0
No.
or any method that can make CPU access managed memory while GPU kernel is running.
Not on a compute capability 5.0 device.
You can have "simultaneous" CPU and GPU access to data using CUDA zero-copy techniques.
A full tutorial on both Unified memory as well as Pinned/Mapped/Zero-copy memory is well beyond the scope of what I can write in an answer here. Unified Memory has its own section in the programming guide. Both of these topics are extensively covered here on the cuda tag on SO as well as many other places on the web. Any questions will likely be answerable with a google search.
In a nutshell, zero-copy memory on 64-bit OS is accessed via a host pinning API such as cudaHostAlloc(). The memory so allocated is host memory, and always remains there, but it is accessible to the GPU. The access to this memory from GPU to host memory occurs across the PCIE bus, so it is much slower than normal global memory access. The pointer returned by the allocation (on 64-bit OS) is usable in both host and device code. You can study CUDA sample codes that use zero-copy techniques such as simpleZeroCopy.
By contrast, ordinary unified memory (UM) is data that will be migrated to the processor that is using it. In a pre-pascal UM regime, this migration is triggered by kernel calls and synchronizing operations. Simultaneous access by host and device in this regime is not possible. For pascal and beyond devices in a proper UM post-pascal regime (basically, 64-bit linux only, CUDA 8+), the data is migrated on-demand, even during kernel execution, thus allowing a limited form of "simultaneous" access. Unified Memory has various behavior modes, and some of those will cause a unified memory allocation to "decay" into a pinned/zero-copy host allocation, under some circumstances.

Programming CUDA architecture

During programming on CUDA architecture I faced a problem: device resources are too limited. In other words, the stack and heap are too small.
While researching about it, I found a function
cudaDeviceSetLimit(cudaLimitStackSize, limit_stack)
that enlarges the stack size, and a similar one for the heap. Although, their dimensions are still too limited.
I wonder how can I store more information on the device?
The stack and heap are provided for convenience. However, you may allocate memory using cudaMalloc on the device if your gpu is recent enough. In that case, the limit is the gpu on-board memory.
Should you want more, you would need a custom memory allocation managing a large array of system memory, and sharing it with the gpu (see cudaHostRegister). Then, the limit would be your system memory.

CUDA vs OpenCL performance comparison

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically:
1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to do a copy to get the memory into another Host memory address space, then it can go out to the network card.
2) Do AMD GPUs allow P2P memory transfers when two GPUs are shared on the same PCIe bus, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on transferring data between GPUs on the same PCIe bus: With GPUDirect, data can move directly between GPUs on the same PCIe bus, without touching host memory. Without GPUDirect, data always has to go back to the host before it can get to another GPU, regardless of where that GPU is located.
Edit: BTW, I'm not entirely sure how much of GPUDirect is vaporware and how much of it is actually useful. I've never actually heard of a GPU programmer using it for something real. Thoughts on this are welcome too.
Although this question is pretty old, I would like to add my answer as I believe the current information here is incomplete.
As stated in the answer by #Ani, you could allocate a host memory using CL_MEM_ALLOC_HOST_PTR and you will most likely get a pinned host memory that avoids the second copy depending on the implementation. For instance, NVidia OpenCL Best Practices Guide states:
OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance
The thing I find missing from previous answers is the fact that AMD offers DirectGMA technology. This technology enables you to transfer data between the GPU and any other peripheral on the PCI bus (including other GPUs) directly witout having to go through system memory. It is more similar to NVidia's RDMA (not available on all platforms).
In order to use this technology you must:
have a compatible AMD GPU (not all of them support DirectGMA). you can use either OpenCL, DirectX or OpenGL extentions provided by AMD.
have the peripheral driver (network card, video capture card etc) either expose a physical address to which the GPU DMA engine can read/write from/to. Or be able to program the peripheral DMA engine to transfer data to / from the GPU exposed memory.
I used this technology to transfer data directly from video capture devices to the GPU memory and from the GPU memory to a proprietary FPGA. Both cases were very efficent and did not involve any extra copying.
Interfacing OpenCL with PCIe devices
I think you may be looking for the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer. While the OpenCL specification states that this flag "This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory", it is uncertain what AMD's implementation (or other implementations) might do with it.
Here's an informative thread on the topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440
Hope this helps.
Edit: I do know that nVidia's OpenCL SDK implements this as allocation in pinned/page-locked memory. I am fairly certain this is what AMD's OpenCL SDK does when running on the GPU.
As pointed out by #ananthonline and #harrism, many of the features of GPUDirect have no direct equivalent in OpenCL. However, if you are trying to reduce memory transfer overhead, as mentioned in the first sentence of your question, zero copy memory might help. Normally, when an application creates a buffer on the GPU, the contents of the buffer are copied from CPU memory to GPU memory en masse. With zero copy memory, there is no upfront copy; instead, data is copied over as it is accessed by the GPU kernel.
Zero copy does not make sense for all applications. Here is advice from the AMD APP OpenCL Programming Guide on when to use it:
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large
host memory buffer is shared between multiple devices and the copies
are too expensive. When choosing this, the cost of the transfer must
be greater than the extra cost of the slower accesses.
Table 4.3 of the Programming Guide describes which flags to pass to clCreateBuffer to take advantage of zero copy (either CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_PERSISTENT_MEM_AMD, depending on whether you want device-accessible host memory or host-accessible device memory). Note that zero copy support is dependent on both the OS and the hardware; it appears to not be supported under Linux or older versions of Windows.
AMD APP OpenCL Programming Guide: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.