GPGPU CUDA memory usage statistics at runtime - cuda

How do I access and output memory statistics (used memory, available memory) from all types of memory at runtime in CUDA C++?
Global memory, Texture memory, Shared memory, Local memory, Registers, (Constant memory?)
Bonus question: Could you point me to documentation on how to do it with the Windows CUDA profiler tool? Is memory profiling supported on all cards, or is it just some specific models that can do it?

For a runtime check of overall memory usage on the device, use the cudaMemGetInfo API. Note that there is no such thing as dedicated texture memory on NVIDIA devices. Textures are stored in global memory and there is no way to separately account for them using any of the CUDA APIs that I am aware of. You can also programmatically inquire about the size of runtime components which consume global memory (runtime heap, printf buffer, stack) using the cudaDeviceGetLimit API.
Constant memory is statically assigned at compile time, and you can get the constant memory usage for a particular translation unit via compile time switches.
There is no way of that I am aware of to check SM level resource usage (registers, shared memory, local memory) dynamically at runtime. You can query the per thread and per block resource requirements of a particular kernel function at runtime using the cudaFuncGetAttributes API.
The Visual profiler can show the same information collected at runtime in its detail view. I am not a big user of the visual profiler, so I am not sure about whether it collects device level memory usage dynamically during a run. I certainly don't recall seeing anything like that, but that doesn't mean it doesn't exist.

Related

Can Unified Memory Migration use NVLink?

I want to check whether unified memory migration (as previously discussed in this page ) across different GPUs can now leveraging the NVLink for the later version of CUDA and GPU architectures.
Yes, unified memory migration can use NVLink.
For a unified memory allocation, this would typically occur when one GPU accesses that allocation, then another GPU accesses that allocation. If those 2 GPUs are in a direct NVLink relationship, the migration of pages from the first to the second will flow over NVLink.
In addition, although you didn't ask about it, NVLink also provides a path for peer-mapped pages, where they do not migrate, but instead a mapping is provided from one GPU to another. The pages may stay on the first GPU, and the 2nd GPU will access them using memory read or write cycles which take place over NVLink.

How to access Managed memory simultaneously by CPU and GPU in compute capability 5.0?

Since simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible (CUDA Toolkit Documentation), is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0 or any method that can make CPU access managed memory while GPU kernel is running.
is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0
No.
or any method that can make CPU access managed memory while GPU kernel is running.
Not on a compute capability 5.0 device.
You can have "simultaneous" CPU and GPU access to data using CUDA zero-copy techniques.
A full tutorial on both Unified memory as well as Pinned/Mapped/Zero-copy memory is well beyond the scope of what I can write in an answer here. Unified Memory has its own section in the programming guide. Both of these topics are extensively covered here on the cuda tag on SO as well as many other places on the web. Any questions will likely be answerable with a google search.
In a nutshell, zero-copy memory on 64-bit OS is accessed via a host pinning API such as cudaHostAlloc(). The memory so allocated is host memory, and always remains there, but it is accessible to the GPU. The access to this memory from GPU to host memory occurs across the PCIE bus, so it is much slower than normal global memory access. The pointer returned by the allocation (on 64-bit OS) is usable in both host and device code. You can study CUDA sample codes that use zero-copy techniques such as simpleZeroCopy.
By contrast, ordinary unified memory (UM) is data that will be migrated to the processor that is using it. In a pre-pascal UM regime, this migration is triggered by kernel calls and synchronizing operations. Simultaneous access by host and device in this regime is not possible. For pascal and beyond devices in a proper UM post-pascal regime (basically, 64-bit linux only, CUDA 8+), the data is migrated on-demand, even during kernel execution, thus allowing a limited form of "simultaneous" access. Unified Memory has various behavior modes, and some of those will cause a unified memory allocation to "decay" into a pinned/zero-copy host allocation, under some circumstances.

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.