Why do I have 200 MB GPU usage even when I only created 1 byte data? - cuda

I run the following code in RTX 3060 and RTX 3080 Ti. And by using nvidia-smi, I found the real GPU usage is 105MB and 247MB for RTX 3060 and RTX 3080 Ti separately. Yet I only have 1 byte data in GPU. Why is this? And Why does the basic GPU usage differ?
// compiled with nvcc -O3 show_basic_gpu_usage.cu -o show_basic_gpu_usage
#include <unistd.h>
#include <iostream>
int main(){
int run_count = 100;
int * ddd;
cudaMalloc(&ddd, 1); // 1 byte
for (int i = 0; i < run_count; i++){
sleep(1);
printf("%d\n" , i);
}
}

Running a CUDA program on a GPU requires something like an operating system, not unlike the way running a typical program you might write on the host system CPU also requires an operating system.
In CUDA this GPU operating system is often referred to as the "CUDA runtime" or perhaps the "CUDA driver". The CUDA runtime does all sorts of administration and housekeeping for the GPU, and it requires (both CPU memory and) GPU memory to do that. Some of this requirement is independent of what your code actually does, some of it may vary based on what your code does.
The memory requirement for this "overhead" can vary based on a number of factors:
exact CUDA version and GPU driver version you are using
the GPU type/architecture
the host operating system
the total amount of GPU memory
which kernels and libraries your code links to or loads
whether or not multiple GPUs are visible to the CUDA runtime
(related) whether or not other consumers are using GPU memory, such as a display driver
and probably other factors
Hundreds of megabytes utilization per GPU for this overhead is common. This overhead is in addition to whatever your program may allocate. It's also common to see variation from one GPU type to another. There isn't any way to exactly predict the amount of overhead that is used, because of the variety of influencing factors.

Related

What does it mean by say GPU under ultilization due to low occupancy?

I am using NUMBA and cupy to perform GPU coding. Now I have switched my code from a V100 NVIDIA card to A100, but then, I got the following warnings:
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
Does anyone know what the two warnings really suggests? How should I improve my code then?
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
A GPU is subdivided into SMs. Each SM can hold a complement of threadblocks (which is like saying it can hold a complement of threads). In order to "fully utilize" the GPU, you would want each SM to be "full", which roughly means each SM has enough threadblocks to fill its complement of threads. An A100 GPU has 108 SMs. If your kernel has less than 108 threadblocks in the kernel launch (i.e. the grid), then your kernel will not be able to fully utilize the GPU. Some SMs will be empty. A threadblock cannot be resident on 2 or more SMs at the same time. Even 108 (one per SM) may not be enough. A A100 SM can hold 2048 threads, which is at least two threadblocks of 1024 threads each. Anything less than 2*108 threadblocks in your kernel launch may not fully utilize the GPU. When you don't fully utilize the GPU, your performance may not be as good as possible.
The solution is to expose enough parallelism (enough threads) in your kernel launch to fully "occupy" or "utilize" the GPU. 216 threadblocks of 1024 threads each is sufficient for an A100. Anything less may not be.
For additional understanding here, I recommend the first 4 sections of this course.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
One of the cool things about a numba kernel launch is that I can pass to it a host data array:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel[blocks, threads](a)
and numba will "do the right thing". In the above example it will:
create a device array that is for storage of a in device memory, let's call this d_a
copy the data from a to d_a (Host->Device)
launch your kernel, where the kernel is actually using d_a
when the kernel is finished, copy the contents of d_a back to a (Device->Host)
That's all very convenient. But what if I were doing something like this:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel1[blocks, threads](a)
my_kernel2[blocks, threads](a)
What numba will do is it will perform steps 1-4 above for the launch of my_kernel1 and then perform steps 1-4 again for the launch of my_kernel2. In most cases this is probably not what you want as a numba cuda programmer.
The solution in this case is to "take control" of data movement:
a = numpy.ones(32, dtype=numpy.int64)
d_a = numba.cuda.to_device(a)
my_kernel1[blocks, threads](d_a)
my_kernel2[blocks, threads](d_a)
a = d_a.to_host()
This eliminates unnecessary copying and will generally make your program run faster, in many cases. (For trivial examples involving a single kernel launch, there probably will be no difference.)
For additional understanding, probably any online tutorial such as this one, or just the numba cuda docs, will be useful.

Would it be possible to access to GPU-RAM from CPU-Cores by simple pointer in a new CUDA6?

Now, if I use this code to try access to GPU-RAM from CPU-Cores by using CUDA5.5 in GeForce GTX460SE (CC2.1), then I get an exception "Access Violation":
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
int main()
{
unsigned char* gpu_ptr = NULL;
cudaMalloc((void **)&gpu_ptr, 1024*1024);
*gpu_ptr = 1;
int q; std::cin >> q;
return 0;
}
But we know, that there is UVA(Unified Virtual Addressing). And there are some new:
25 October 2013 - 331.17 Beta Linux GPU Driver: The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA's CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM. http://www.phoronix.com/scan.php?page=news_item&px=MTQ5NDc
Key features of CUDA 6 include: Unified Memory -- Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages. http://www.techpowerup.com/194505/nvidia-dramatically-simplifies-parallel-programming-with-cuda-6.html
Would it be possible to access memory GPU-RAM from CPU-Cores by using the simple pointer in a new CUDA6?
Yes, the new unified memory feature in CUDA 6 will make it possible, on Kepler devices and beyond (so not on your Fermi GPU) to share pointers between host and device code.
In order to accomplish this, you will need to use a Kepler device (so cc 3.0 or 3.5) and the new cudaMallocManaged API. This will be further documented when CUDA 6.0 is officially available, but in the meantime you can read more about it at this blog, which includes examples.
This mechanism does not magically cause the effects of the PCI Express bus to disappear, so in effect what is happening is that two copies of the data are being made "behind the scenes" and cudaMemcpy operations are scheduled automatically by the cuda runtime, as needed. There are a variety of other implementation issues to be aware of, for now I would suggest reading the blog.
Note that Unified Memory (UM) is distinct from Unified Virtual Addressing (UVA) which has been available since CUDA 4.0 and is documented.

Alternatives to malloc for dynamic memory allocations in CUDA kernel functions

I'm trying to compile my CUDA C code for a GPU with sm_10 architecture which does not support invoking malloc from __global__ functions.
I need to keep a tree for which the nodes are created dynamically in the GPU memory. Unfortunately, without malloc apparently I can't do that.
Is there is a way to copy an entire tree using cudaMalloc? I think that such an approach will just copy the root of my tree.
Quoting the CUDA C Programming Guide
Dynamic global memory allocation and operations are only supported by devices of
compute capability 2.x and higher.
For compute capability earlier than 2.0, the only possibilities are:
Use cudaMalloc from host side to allocate as much global memory as you need in your __global__ function;
Use static allocation if you know the required memory size at compile time;

Peer-to-Peer CUDA transfers

I heard about peer-to-peer memory transfers and read something about it but could not really understand how much fast this is compared to standard PCI-E bus transfers.
I have a CUDA application which uses more than one gpu and I might be interested in P2P transfers. My question is: how fast is it compared to PCI-E? Can I use it often to have two devices communicate with each other?
A CUDA "peer" refers to another GPU that is capable of accessing data from the current GPU. All GPUs with compute 2.0 and greater have this feature enabled.
Peer to peer memory copies involve using cudaMemcpy to copy memory over PCI-E as shown below.
cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToDevice);
Note that dst and src can be on different devices.
cudaDeviceEnablePeerAccess enables the user to launch a kernel that uses data from multiple devices. The memory accesses are still done over PCI-E and will have the same bottlenecks.
A good example of this would be simplep2p from the cuda samples.

Running parallel CUDA tasks

I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.