Programming CUDA architecture - cuda

During programming on CUDA architecture I faced a problem: device resources are too limited. In other words, the stack and heap are too small.
While researching about it, I found a function
cudaDeviceSetLimit(cudaLimitStackSize, limit_stack)
that enlarges the stack size, and a similar one for the heap. Although, their dimensions are still too limited.
I wonder how can I store more information on the device?

The stack and heap are provided for convenience. However, you may allocate memory using cudaMalloc on the device if your gpu is recent enough. In that case, the limit is the gpu on-board memory.
Should you want more, you would need a custom memory allocation managing a large array of system memory, and sharing it with the gpu (see cudaHostRegister). Then, the limit would be your system memory.

Related

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

GPU Memory Allocation under CUDA 8 and Pascal Architecture

Pascal Architecture has brought an amazing feature for CUDA developers by upgrading the unified memory behavior, allowing them to allocate GPU memory way bigger than available on the system.
I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).
I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).
First of all I suggest you do proper CUDA error checking on all CUDA API calls. It would seem from your description that you are not.
demand-paging in unified memory (UM) allowing the increase in memory size beyond the GPU physical DRAM memory will only work with:
Pascal (or future) GPUs
CUDA 8 (or future) toolkit
Other than that, your setup should probably work. If it's not working for you with CUDA 8 (not CUDA 8RC) and a Pascal GPU, make sure that you meet the requirements (e.g. OS) for UM and also do proper error checking. Rather than trying to infer what is happening from nvidia-smi, run an actual test on the allocated memory.
For a more general description of the feature I refer you to this blog article.

how does cuda 4.0 support recursion

I'm wondering, does cuda 4.0 support recursion using local memory or shared memory? I have to maintain a stack using global memory by myself, because the system-level recursion can't support my program (probably too many levels of recursion). When the recursion get deeper, the threads stop working.
So I really want to know how the default recursion work in CUDA, does it use local memory of shared memory? Thanks!
Use of recursion requires the use of the ABI, which requires architecture >= sm_20. The ABI has a function calling convention that includes the use of a stack frame. The stack frame is allocated in local memory ("local" means "thread-local", that is, storage private to a thread). Please refer to the CUDA C Programming Guide for basic information on CUDA memory spaces. In addition, you may want to have a look at this previous question: Where does CUDA allocate the stack frame for kernels?
For deeply recursive functions it is possible to exceed the default stack size. For example, on my current system the default stack size is 1024 bytes. You can retrieve the current stack size via the CUDA API function cudaDeviceGetLimit(). You can adjust the stack size via the CUDA API function cudaDeviceSetLimit():
cudaError_t stat;
size_t myStackSize = [your preferred stack size];
stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
Note that the total amount of memory needed for stack frames is at least the per-thread size multiplied by the number of threads specified in the kernel launch. Often it can be larger due to allocation granularity. So increasing the stack size can eat up memory pretty quickly, and you may find that a deeply recursive function requires more local memory than can be allocated on your GPU.
While recursion is supported on modern GPUs, its use can lead to code with fairly low performance due to function call overhead, so you may want to check whether there is an iterative version of the algorithm you are implementing that may be better suited to the GPU.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.