Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?
Where does the memory get allocated, if its on the device is it local or global?
It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.
Thanks!
C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.
You can adjust the amount of device global memory available for the heap using a runtime API call.
You may also be interested in the C++ new/delete sample code.
CC 2.0 or greater is required for these capabilities.
Related
I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.
In cuda kernel functions there's no automatic garbage collection. What's the better practice for temporary device pointers in Cuda? Reuse a fixed device pointer, or create and free device pointers?
For example, to write a Cuda kernel function for sum of squared errors between two vectors, it's more convenient to have a temporary device pointer for storing the difference of the two vectors and then sum the squares of elements of this temporary device pointer. An option is to allocate a temporary device pointer and then free it for every function call, and an another option is to have a constantly reused temporary device pointer.
What's the better practice between these two options?
If you can use cudaMalloc and cudaFree and avoid multiple allocations, you should avoid using dynamic memory allocation within the kernel, as it has an additional cost on performance and is limited in size depending on the launch configuration :
The following API functions get and set the heap size:
cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and
cudaDeviceGetLimit() return the currently requested heap size.
The actual memory allocation for the heap occurs when a module is
loaded into the context, either explicitly via the CUDA driver API
(see Module), or implicitly via the CUDA runtime API (see CUDA C
Runtime).
See Dynamic global memory allocation in CUDA Documentation.
I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.
Suppose I have code that looks like this:
cudaHostAlloc( (void**)&pagelocked_ptr, size, cudaHostAllocDefault )
#pragma omp parallel num_threads(num_streams)
{
...
cudaMemcpyAsync( pagelocked_ptr + offset_thisthread
, src
, count
, kind
, stream_thisthread );
...
}
Note that I explicitly avoided setting the flag cudaHostAllocPortable here. Each thread uses its own stream, and (I believe) implicitly selects the default Cuda device.
According to Cuda by Example Section 11.4,
pages can appear pinned to a single CPU thread only. That is, they will remain page-locked if any thread has allocated them as pinned memory, but they will only appear page-locked to the thread that allocated them.
They go on to say that setting cudaHostAllocPortable can fix this issue and allow all threads to recognize the allocation as a pinned buffer. Therefore, my cudaMemcpyAsync call above will fail unless I specify cudaHostAllocPortable instead of cudaHostAllocDefault.
The Cuda C Guide appears to conflict with this information. My assumption is that the Cuda context keeps track of which regions of host memory are page-locked and can be transferred to the device without an intermediate staging copy. According to the current Cuda C Guide 3.2.1 and 3.2.4.1
the primary context for this device...is shared among all the host threads of the application.
and
by default, the benefits of using page-locked memory described above are only available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any...)
These seem to imply that the page-locked nature of the allocation is known by Cuda calls across different threads, since they're all using device 0, and that calls to cudaMemcpyAsync() in all threads will succeed. In other words, if I'm interpreting correctly, setting cudaHostAllocPortable is only necessary when attempting to share page-locked memory between Cuda contexts (eg. when one is switching between GPUs with cudaSetDevice, and offloading a chunk of the page-locked allocation to each one).
Is the information in Cuda by Example simply out of date? Talonmies' reply to this question states
Prior to CUDA 4, contexts were not thread safe and needed to be explicitly migrated via the context migration API.
but I am not sure how this affected the visibility of page-locked status to Cuda calls from different threads.
Thanks in advance for your help!
The pagelocked status should be evident to all threads that are using the same context on a particular device. If you are using the runtime API (as you are here) then there is normally only one context per device per process, so all threads within that process should be sharing the same context on a particular device, and have the same view of any pointers in that context.
One of the functions of the cudaHostAllocPortable flag is described in the CUDA documentation:
The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
The implication is that in a multi-context setting or multi-device setting (a context is unique to a particular device), it is necessary to use this flag to get pinned behavior from that pointer in all contexts visible to the process.
To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.