Where can I find information about the Unified Virtual Addressing in CUDA 4.0? - cuda

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.

Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.

CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.

Related

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

configure local (shared) memory for OpenCL using Nvidia platforms

I want to optimize my local memory access pattern within my OpenCL kernel. I read at somewhere about configurable local memory. E.g. we should be able to configure which amount is used for local mem and which amount is used for automatic caching.
Also i read that the bank size can be chosen for the latest (Kepler) Nvidia hardware here:
http://www.acceleware.com/blog/maximizing-shared-memory-bandwidth-nvidia-kepler-gpus. This point seems to be very crucial for double precision value storing in local memory.
Does Nvidia provide the functionality of setting up the local memory exclusively for CUDA users? I can't find similar methods for OpenCL. So is this maybe called in a different way or does it really not exist?
Unfortunately there is no way to control the L1 cache/local memory configuration when using OpenCL. This functionality is only provided by the CUDA runtime (via cudaDeviceSetCacheConfig or cudaFuncSetCacheConfig).

Writing output files from CUDA devices

I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.

Peer-to-Peer CUDA transfers

I heard about peer-to-peer memory transfers and read something about it but could not really understand how much fast this is compared to standard PCI-E bus transfers.
I have a CUDA application which uses more than one gpu and I might be interested in P2P transfers. My question is: how fast is it compared to PCI-E? Can I use it often to have two devices communicate with each other?
A CUDA "peer" refers to another GPU that is capable of accessing data from the current GPU. All GPUs with compute 2.0 and greater have this feature enabled.
Peer to peer memory copies involve using cudaMemcpy to copy memory over PCI-E as shown below.
cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToDevice);
Note that dst and src can be on different devices.
cudaDeviceEnablePeerAccess enables the user to launch a kernel that uses data from multiple devices. The memory accesses are still done over PCI-E and will have the same bottlenecks.
A good example of this would be simplep2p from the cuda samples.

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically:
1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to do a copy to get the memory into another Host memory address space, then it can go out to the network card.
2) Do AMD GPUs allow P2P memory transfers when two GPUs are shared on the same PCIe bus, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on transferring data between GPUs on the same PCIe bus: With GPUDirect, data can move directly between GPUs on the same PCIe bus, without touching host memory. Without GPUDirect, data always has to go back to the host before it can get to another GPU, regardless of where that GPU is located.
Edit: BTW, I'm not entirely sure how much of GPUDirect is vaporware and how much of it is actually useful. I've never actually heard of a GPU programmer using it for something real. Thoughts on this are welcome too.
Although this question is pretty old, I would like to add my answer as I believe the current information here is incomplete.
As stated in the answer by #Ani, you could allocate a host memory using CL_MEM_ALLOC_HOST_PTR and you will most likely get a pinned host memory that avoids the second copy depending on the implementation. For instance, NVidia OpenCL Best Practices Guide states:
OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance
The thing I find missing from previous answers is the fact that AMD offers DirectGMA technology. This technology enables you to transfer data between the GPU and any other peripheral on the PCI bus (including other GPUs) directly witout having to go through system memory. It is more similar to NVidia's RDMA (not available on all platforms).
In order to use this technology you must:
have a compatible AMD GPU (not all of them support DirectGMA). you can use either OpenCL, DirectX or OpenGL extentions provided by AMD.
have the peripheral driver (network card, video capture card etc) either expose a physical address to which the GPU DMA engine can read/write from/to. Or be able to program the peripheral DMA engine to transfer data to / from the GPU exposed memory.
I used this technology to transfer data directly from video capture devices to the GPU memory and from the GPU memory to a proprietary FPGA. Both cases were very efficent and did not involve any extra copying.
Interfacing OpenCL with PCIe devices
I think you may be looking for the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer. While the OpenCL specification states that this flag "This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory", it is uncertain what AMD's implementation (or other implementations) might do with it.
Here's an informative thread on the topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440
Hope this helps.
Edit: I do know that nVidia's OpenCL SDK implements this as allocation in pinned/page-locked memory. I am fairly certain this is what AMD's OpenCL SDK does when running on the GPU.
As pointed out by #ananthonline and #harrism, many of the features of GPUDirect have no direct equivalent in OpenCL. However, if you are trying to reduce memory transfer overhead, as mentioned in the first sentence of your question, zero copy memory might help. Normally, when an application creates a buffer on the GPU, the contents of the buffer are copied from CPU memory to GPU memory en masse. With zero copy memory, there is no upfront copy; instead, data is copied over as it is accessed by the GPU kernel.
Zero copy does not make sense for all applications. Here is advice from the AMD APP OpenCL Programming Guide on when to use it:
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large
host memory buffer is shared between multiple devices and the copies
are too expensive. When choosing this, the cost of the transfer must
be greater than the extra cost of the slower accesses.
Table 4.3 of the Programming Guide describes which flags to pass to clCreateBuffer to take advantage of zero copy (either CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_PERSISTENT_MEM_AMD, depending on whether you want device-accessible host memory or host-accessible device memory). Note that zero copy support is dependent on both the OS and the hardware; it appears to not be supported under Linux or older versions of Windows.
AMD APP OpenCL Programming Guide: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf