Pinned memory in OpenACC (using PGI compiler) - cuda

I have a simple CUDA code which I translated to OpenACC. All my kernels were parallelized as expected and they have similar performance to my CUDA kernels. However, the device-to-host memory transfer kills my performance. In my CUDA code I use pinned memory and the performance is much better. Unfortunately, in OpenACC I don't know how to utilize pinned memory. I couldn't find anything in the documentation. Can someone provide me a simple OpenACC example that makes use of pinned memory?
PS: I am using PGI 16.10-0 64-bit compiler

Use the "pinned" sub-option for a "tesla" target, "-ta=tesla:pinned". Note that you can see all the available sub-options via the "-help -ta" flags.

Related

CUDA equivalent of OpenCL CL_MEM_USE_HOST_PTR

I'd like to know if there is something similar to CL_MEM_USE_HOST_PTR but for CUDA. Reading the CUDA docs it seems the only "zero-copy" functionality is implemented through the API function cudaHostAlloc. The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area. A thing that is normal with OpenCL using the specificied flag for clCreateBuffer.
Maybe I am wrong, but it looks like CUDA doesn't implement such a thing at all.
The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area.
The API call that does that in CUDA is cudaHostRegister(), see here.
It takes a pointer returned by an ordinary host allocator such as malloc() or new, and converts the memory region into pinned memory. (Which would be suitable for "zero-copy" usage, among other things.)

CUDA Unified Memory and use of std::vector in device code

Back in the days, std::vector was not allowed in CUDA device code. Is that still true with current cuda 10.2 toolkit with Unified Memory?
I have a few public data members of type std::vector in a class that is passed reference to be used by a device kernel.
nvcc complains about calling a host function("std::vector...) ... from a global function("...) not allowed.
What is the correct way to use unified memory, if at all possible, to use on an std::vector? If it is not possible, is there an efficient work-around?
Back in the days, std::vector was not allowed in CUDA device code. Is that still true with current cuda 10.2 toolkit with Unified Memory?
Yes.
What is the correct way to use unified memory, if at all possible, to use on an std::vector?
There is not one. It isn't possible. There is no C++ standard library support on the device.
If it is not possible, is there an efficient work-around?
No there is not.

Variable in the shared and managed memory in cuda

With the unified memory feature now in CUDA, variables can be in the managed memory and this makes the code a little simpler. Whereas the shared memory is shared between threads in a thread block. See section 3.2 in CUDA Fortran.
My question is can a variable be in both the shared and managed memory? They would be managed in the host but on the device, they would be sharedWhat type of behaviour maybe expected of this type of variable ?
I am using CUDA Fortran. I ask this question because, declaring the variable as managed makes it easier for me to code whereas making it shared in the device makes it faster than the global device memory.
I could not find anything that gave me a definitive answer in the documentation.
My question is can a variable be in both the shared and managed memory?
No, it's not possible.
Managed memory is created with either a static allocator (__managed__ in CUDA C, or managed attribute in CUDA fortran) or a dynamic allocator (cudaMallocManaged in CUDA C, or managed (allocatable) attribute in CUDA fortran).
Both of these are associated with the logical global memory space in CUDA. The __shared__ (or shared) memory space is a separate logical space, and must be used (allocated, accessed) explicitly, independent of any global space usage.

Could cublas do pinned memory allocation?

I understand that pinned memory allocated by "cudaHostAlloc" can be transferred more efficiently to device than "malloc"'ed memory can. However, I think "cudaHostAlloc" can only be compiled by cuda compiler. My scenario is to use cublas API without cuda compiler, and it seems that cublas doesn't provide function for pinned memory allocation from the handbook, or maybe I miss something...
cudaHostAlloc() is implemented in the CUDA Runtime API. You don't need to compile with nvcc to use CUDA API calls, you can just include the appropriate header (e.g. cuda_runtime_api.h) and link with the runtime library (cudart).

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.