How do I explain performance variability over PCIe bus? - cuda

On my CUDA program I see large variability between different runs (upto 50%) in communication time which include host to device and device to host data transfer times over PCI Express for pinned memory. How can I explain this variability? Does it happen when the PCI controller and memory controller is busy performing other PCIe transfers? Any insight/reference is greatly appreciated. The GPU is Tesla K20c, the host is AMD Opteron 6168 with 12 cores running the Linux operating system. The PCI Express version is 2.0.

The system you are doing this on is a NUMA system, which means that each of the two discrete CPUs (the Opteron 6168 has two 6 core CPUs in a single package) in your host has its own memory controller and there maybe a different number of HyperTransport hops between each CPUs memory and the PCI-e controller hosting your CUDA device.
This means that, depending on CPU affinity, the thread which runs your bandwidth tests may have different latency to both host memory and the GPU. This would explain the differences in timings which you are seeing

Related

KVM or vSphere for shared access to GPU across virtual machines

Aside from opinions about which is liked better and preference for Open Source are there any good reasons to prefer vSphere (the feee version) over KVM when setting up a server for Deep Learning experiments.
The machine is an 8 core Xeon with 64GB RAM and two Nvidia GPS cards.
What I don't know is if GPU virtualization or GPU pass through works better with one hypervisor or the other or even if virtualization is possible with both of them. If one of the hypervisors allowed two virtual machines to share a GPU that would be a killer feature.
Or is using a hypervisor just a bad idea? I hope not, but I will listen to reasons.

How to get maximum GPU memory usage for a process on Ubuntu? (For Nvidia GPU)

I have a server with Ubuntu 16.04 installed. It has a K80 GPU. Multiple processes are using the GPU.
Some processes have unpredictable GPU usage, and I want to reliably monitor their GPU usage.
I know that you can query GPU usage via: nvidia-smi, but that only gives you the usage at the queried time.
Currently I query the information every 100 ms, but that's just sampling the GPU usage, and can potentially skip peak GPU usage.
Is there a reliable way for me to get the maximum GPU memory usage for a given PID process?
Try using the NVIDIA Visual Profiler. I am not sure how accurate it is but it gives you a graph of the device memory usage at different times when your program is running.

Data transfer rate to and from host to device [duplicate]

On my CUDA program I see large variability between different runs (upto 50%) in communication time which include host to device and device to host data transfer times over PCI Express for pinned memory. How can I explain this variability? Does it happen when the PCI controller and memory controller is busy performing other PCIe transfers? Any insight/reference is greatly appreciated. The GPU is Tesla K20c, the host is AMD Opteron 6168 with 12 cores running the Linux operating system. The PCI Express version is 2.0.
The system you are doing this on is a NUMA system, which means that each of the two discrete CPUs (the Opteron 6168 has two 6 core CPUs in a single package) in your host has its own memory controller and there maybe a different number of HyperTransport hops between each CPUs memory and the PCI-e controller hosting your CUDA device.
This means that, depending on CPU affinity, the thread which runs your bandwidth tests may have different latency to both host memory and the GPU. This would explain the differences in timings which you are seeing

What is the maximum size of a page-locked host memory allocation?

I'm running CUDA 5.0 on 64-bit Ubuntu 13.04 with an NVIDIA GTS 250 that has 1 GB of memory and NVIDIA driver 319.17. The data set I'm using in my computations is too large to fit on the card itself, so I'm trying to allocate page-locked memory on the host system using cudaHostAlloc with the cudaHostAllocMapped flag. The data I'm using is about 18 GB in size, and the host has 24 GB of RAM. My problem is that whenever I try to allocate more than 4 GB of page-locked memory, in any number of chunks, I am given the "out of memory" error. With the standard C malloc I can allocate the whole 18 GB in one shot, but if I try to map it with cudaHostRegister I am still limited to 4 GB.
What is the maximum size of a page-locked allocation in CUDA? Is this a issue in my system or is this limit set by the hardware, the driver, or the CUDA version? Is there any way to allocate such a large array that can be mapped for the GPU?
SM 1.x class hardware only supports 32-bit addressing. You might be able to allocate more than 4GB of pinned memory, provided you remove the cudaHostMapped flag (and the cudaDeviceMapHost flag from cudaSetDeviceFlags()). That would enable you to use asynchronous memory copies to transfer data into and out of GPU memory.
But to map more than 4G of memory, you need to use SM 2.x or later, on a 64-bit platform.

What does the nVIDIA CUDA driver do exactly?

What does Nvidia CUDA driver do exactly? from the perspective of using CUDA.
The driver passes the kernel code, with the execution configuration (#threads, #blocks)...
and what else?
I saw some post that the driver should be aware of the number of available SMs.
But isn't that unnecessary ? Once the kernel is passed to GPU, the GPU scheduler just needs to spread the work to available SMs...
The GPU isn't a fully autonomous device, it requires a lot of help from the host driver to do even
the simplest things. As I understand it the driver contains at least:
JIT compiler/optimizer (PTX assembly code can be compiled by the driver at runtime, the driver will also recompile code to match the execution architecture of the device if required and possible)
Device memory management
Host memory management (DMA transfer buffers, pinned and mapped host memory, unified addressing model)
Context and runtime support (so code/heap/stack/printf buffer memory management), dynamic symbol management, streams, etc
Kernel "grid level" scheduler (includes managing multiple simultaneous kernels on architectures that support it)
Compute mode management
Display driver interop (for DirectX and OpenGL resource sharing)
That probably represents the bare minimum that is required to get some userland device code onto a GPU and running via the host side APIs.