Does gpuocelot support dynamic memory allocation in CUDA device? - cuda

My algorithm (parallel multi-frontal Gaussian elimination) needs to dynamically allocate memory (tree building) inside CUDA kernel. Does anyone know if gpuocelot supports such things?
According to this: stackoverflow-link and CUDA programming guide I can do such things. But with gpuocelot I get errors during runtime.
Errors:
When I call malloc() inside kernel I get this error:(2.000239) ExternalFunctionSet.cpp:371: Assertion message: LLVM required to call external host functions from PTX.
solver: ocelot/ir/implementation/ExternalFunctionSet.cpp:371: void ir::ExternalFunctionSet::ExternalFunction::call(void*, const ir::PTXKernel::Prototype&): Assertion false' failed.
When I try to get or set malloc heap size (inside host code):solver: ocelot/cuda/implementation/CudaRuntimeInterface.cpp:811: virtual cudaError_t cuda::CudaRuntimeInterface::cudaDeviceGetLimit(size_t*, cudaLimit): Assertion `0 && "unimplemented"' failed.
Maybe I have to point (somehow) to compiler that I want to use device malloc()?
Any advice?

You can find the answer in gpu ocelot mailing list:
gpuocelot mailing list link

Related

CUDA equivalent of OpenCL CL_MEM_USE_HOST_PTR

I'd like to know if there is something similar to CL_MEM_USE_HOST_PTR but for CUDA. Reading the CUDA docs it seems the only "zero-copy" functionality is implemented through the API function cudaHostAlloc. The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area. A thing that is normal with OpenCL using the specificied flag for clCreateBuffer.
Maybe I am wrong, but it looks like CUDA doesn't implement such a thing at all.
The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area.
The API call that does that in CUDA is cudaHostRegister(), see here.
It takes a pointer returned by an ordinary host allocator such as malloc() or new, and converts the memory region into pinned memory. (Which would be suitable for "zero-copy" usage, among other things.)

What's the equivalent system API implementation to cudaMallocHost

Hi I want to allocate pinned memory but not using cudaMallocHost, I've read this post and tried to use fixed mmap to emulate 'cudaMallocHost' :
data_mapped_ = (void *)mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED, fd_, 0);
if(munmap(data_mapped_, sb.st_size) == -1) {
cerr << "munmap failed" << endl;
exit(-1);
}
data_mapped_ = (void *)mmap(data_mapped_, sb.st_size, PROT_READ, MAP_SHARED|MAP_FIXED, fd_, 0);
But this is still not as fast as cudaMallocHost. So what's the correct c implementation of pinned memory?
CUDA pinned memory (e.g. those pointers returned by cudaMallocHost, cudaHostAlloc, or cudaHostRegister) has several characteristics. One characteristic is that it is non-pageable and this characteristic is largely provided by underlying system/OS calls.
Another characteristic is that it is registered with the CUDA driver. This registration means the driver keeps track of the starting address and size of the pinned allocation. It uses that information to decide exactly how it will process future API calls that touch that region, such as cudaMemcpy or cudaMemcpyAsync.
You could conceivably provide the non-pageable aspect by performing your own system calls. The only way to perform the CUDA driver registration function is to actually call one of the aforementioned CUDA API calls.
Therefore there is no sequence of purely C library or system library calls that can completely mimic the behavior of one of the aforementioned CUDA API calls that provide "pinned" memory.

Alternatives to malloc for dynamic memory allocations in CUDA kernel functions

I'm trying to compile my CUDA C code for a GPU with sm_10 architecture which does not support invoking malloc from __global__ functions.
I need to keep a tree for which the nodes are created dynamically in the GPU memory. Unfortunately, without malloc apparently I can't do that.
Is there is a way to copy an entire tree using cudaMalloc? I think that such an approach will just copy the root of my tree.
Quoting the CUDA C Programming Guide
Dynamic global memory allocation and operations are only supported by devices of
compute capability 2.x and higher.
For compute capability earlier than 2.0, the only possibilities are:
Use cudaMalloc from host side to allocate as much global memory as you need in your __global__ function;
Use static allocation if you know the required memory size at compile time;

CUDA new delete

Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?
Where does the memory get allocated, if its on the device is it local or global?
It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.
Thanks!
C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.
You can adjust the amount of device global memory available for the heap using a runtime API call.
You may also be interested in the C++ new/delete sample code.
CC 2.0 or greater is required for these capabilities.

cudaMalloc always gives out of memory

I'm facing a simple problem, where all my calls to cudaMalloc fail, giving me an out of memory error, even if its just a single byte I'm allocating.
The cuda device is available and there is also a lot of memory available (bot checked with the corresponding calls).
Any idea what the problem could be?
Please try to call cudaSetDevice(), then cudaDeviceSynchronize() and then cudaThreadSynchronize() at the beginning of the code itself.
cudaSetDevice(0) if there is only one device. by default the CUDA run time will initialize the device 0.
cudaSetDevice(0);
cudaDeviceSynchronize();
cudaThreadSynchronize();
Please reply back your observation. If still it is getting failed, please specify the OS, architecture, CUDA SDK version, CUDA driver version. if possible please provide the code/code snippet which is being failed.
Thank you everybody for your help.
The problem was not really with the cudaMalloc itself but it shadowed the real problem which was due to the initialisation of cuda which seemed to fail.
Because the first call to cuda was in a separate Thread I did'nt have a GLContext available, leading to failures. I needed to make sure that I initialised cuda by a dummy malloc in the main thread after the initialisation of the context.