Could cublas do pinned memory allocation? - cuda

I understand that pinned memory allocated by "cudaHostAlloc" can be transferred more efficiently to device than "malloc"'ed memory can. However, I think "cudaHostAlloc" can only be compiled by cuda compiler. My scenario is to use cublas API without cuda compiler, and it seems that cublas doesn't provide function for pinned memory allocation from the handbook, or maybe I miss something...

cudaHostAlloc() is implemented in the CUDA Runtime API. You don't need to compile with nvcc to use CUDA API calls, you can just include the appropriate header (e.g. cuda_runtime_api.h) and link with the runtime library (cudart).

Related

CUDA equivalent of OpenCL CL_MEM_USE_HOST_PTR

I'd like to know if there is something similar to CL_MEM_USE_HOST_PTR but for CUDA. Reading the CUDA docs it seems the only "zero-copy" functionality is implemented through the API function cudaHostAlloc. The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area. A thing that is normal with OpenCL using the specificied flag for clCreateBuffer.
Maybe I am wrong, but it looks like CUDA doesn't implement such a thing at all.
The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area.
The API call that does that in CUDA is cudaHostRegister(), see here.
It takes a pointer returned by an ordinary host allocator such as malloc() or new, and converts the memory region into pinned memory. (Which would be suitable for "zero-copy" usage, among other things.)

CUDA Unified Memory and use of std::vector in device code

Back in the days, std::vector was not allowed in CUDA device code. Is that still true with current cuda 10.2 toolkit with Unified Memory?
I have a few public data members of type std::vector in a class that is passed reference to be used by a device kernel.
nvcc complains about calling a host function("std::vector...) ... from a global function("...) not allowed.
What is the correct way to use unified memory, if at all possible, to use on an std::vector? If it is not possible, is there an efficient work-around?
Back in the days, std::vector was not allowed in CUDA device code. Is that still true with current cuda 10.2 toolkit with Unified Memory?
Yes.
What is the correct way to use unified memory, if at all possible, to use on an std::vector?
There is not one. It isn't possible. There is no C++ standard library support on the device.
If it is not possible, is there an efficient work-around?
No there is not.

Pinned memory in OpenACC (using PGI compiler)

I have a simple CUDA code which I translated to OpenACC. All my kernels were parallelized as expected and they have similar performance to my CUDA kernels. However, the device-to-host memory transfer kills my performance. In my CUDA code I use pinned memory and the performance is much better. Unfortunately, in OpenACC I don't know how to utilize pinned memory. I couldn't find anything in the documentation. Can someone provide me a simple OpenACC example that makes use of pinned memory?
PS: I am using PGI 16.10-0 64-bit compiler
Use the "pinned" sub-option for a "tesla" target, "-ta=tesla:pinned". Note that you can see all the available sub-options via the "-help -ta" flags.

Alternatives to malloc for dynamic memory allocations in CUDA kernel functions

I'm trying to compile my CUDA C code for a GPU with sm_10 architecture which does not support invoking malloc from __global__ functions.
I need to keep a tree for which the nodes are created dynamically in the GPU memory. Unfortunately, without malloc apparently I can't do that.
Is there is a way to copy an entire tree using cudaMalloc? I think that such an approach will just copy the root of my tree.
Quoting the CUDA C Programming Guide
Dynamic global memory allocation and operations are only supported by devices of
compute capability 2.x and higher.
For compute capability earlier than 2.0, the only possibilities are:
Use cudaMalloc from host side to allocate as much global memory as you need in your __global__ function;
Use static allocation if you know the required memory size at compile time;

Is it possible to allocate the pinned memory from mex interface while using CUDA with MATLAB?

I read this pdf and I have been using CUDA with mex for quite a while. I was wondering whether new generation GPU's such as Fermi and Kepler allow allocating pinned memory from MATLAB?
I tried this and it worked out fine. So I believe it is possible to allocate the pinned memory from mex interface while using CUDA with MATLAB.