Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName() - cuda

I'm running this command into a shell and get:
C:\Users\me>nvidia-smi -L
GPU 0: Quadro K2000 (UUID: GPU-b1ac50d1-019c-58e1-3598-4877fddd3f17)
GPU 1: Quadro 2000 (UUID: GPU-1f22a253-c329-dfb7-0db4-e005efb6a4c7)
But in my code, when I run cuDeviceGetName(.., ID) where ID is the ID given by the nvidia-smi output, the devices have been inverted: GPU 0 becomes Quadro 2000 and GPU 1 becomes Quadro K2000.
Is this an expected behavior or a bug ? Does anyone know a workaround to make nvidia-smi get the 'real' ID of GPUs ? I could use the UUID to get the proper device with nvmlDeviceGetUUID() but using nvml API seems a bit too complicated for what I'm trying to achieve.
This question discuss how CUDA assign IDs to devices without clear conclusion.
I am using CUDA 6.5.
EDIT: I've had a look at nvidia-smi manpage (should have done that earlier...). It states:
"It is recommended that users desiring consistencyuse either UUDI or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent"
Still looking for a kludge...

You can set the device order for CUDA environment in your shell to follow the bus ID instead of the default of fastest card. Requires CUDA 7 and up.
export CUDA_DEVICE_ORDER=PCI_BUS_ID

It's expected behavior.
nvidia-smi enumerates in PCI order.
By default, the CUDA driver and runtime APIs do not.
The question you linked clearly shows how to associate the two numbering/ordering schemes.
There is no way to cause nvidia-smi to modify its ordering scheme to match whatever will be generated by the CUDA runtime or driver APIs. However you can modify the CUDA runtime enumeration order through the use of an environment variable in CUDA 8.

It's the expected behaviour.
nvidia-smi manpage says that
the GPU/Unit's 0-based index in the natural enumeration returned by the driver,
CUDA API enumerates in descending order of compute capability according to "Programming Guide" 3.2.6.1 Device enumeration.
I had this problem and I have written a program is analog of nvidia-smi, but with enumerated devices in an order consistent with CUDA API. Farther in the text ref on the program
https://github.com/smilart/nvidia-cdl
I have written the program because nvidia-smi cannot enumerated device in an order consistent with CUDA API.

Related

How do I make sure Vulkan is using the same GPU as CUDA?

I'm using an application that uses both vulkan and cuda (specifically pytorch) on an HPC cluster (univa grid engine).
When a job is submitted, the cluster scheduler sets an environment variable SGE_HGR_gpu which contains a GPU ID for the job to use (so other jobs run by other users do not use the same GPU)
The typical way to tell an application that uses CUDA to use a specific GPU is to set CUDA_VISIBLE_DEVICES=n
As i'm also using Vulkan, I dont know how to make sure that I choose the same device from those that are listed with vkEnumeratePhysicalDevices.
I think that the order of the values that 'n' can take is the same as the order of the devices on the PCI BUS, however I dont know if the order of the devices returned by vkEnumeratePhysicalDevices are in this order, and the documentation does not specify what this order is.
So how can I go about making sure i'm choosing the same physical GPU for both Vulkan and CUDA?
With VkPhysicalDeviceIDPropertiesKHR (Vulkan 1.1) resp VkPhysicalDeviceVulkan11Properties (Vulkan 1.2) you can get device UUID, which is one of the formats CUDA_VISIBLE_DEVICES seems to use. You should also be able to convert index to UUID (or vice versa) with nvidia-smi -L (or with NVML library).
Or other way around, cudaDeviceProp includes PCI info which could be compared to VK_EXT_pci_bus_info extensions output.
If the order matches in Vulkan, it is best to ask NVidia directly; I cannot find info how NV orders them. IIRC from the Vulkan Loader implementation, the order should match the order from registry, and then the order the NV driver itself gives them. Even so you would have to filter non-NV GPUs from the list in generic code, and you do not know if the NV Vulkan ICD implementation matches CUDA without asking NV.

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.

CUDA GPU selected by position, but how to set default to be something other than device 0?

I've recently installed a second GPU (Tesla K40) on my machine at home and my searches have suggested that the first PCI slot becomes the default GPU chosen for CUDA jobs. A great link is explaining it can be found here:
Default GPU Assignment
My original GPU is a TITAN X, also CUDA enabled, but it's really best for single precision calculations and the Tesla better for double precision. My question for the group is whether there is a way to set up my default CUDA programming device to be the second one always? Obviously I can specify in the code each time which device to use, but I'm hoping that I can configure my set such that it will always default to using the Tesla card.
Or is the only way to open the box up and physically swap positions of the devices? Somehow that seems wrong to me....
Any advice or relevant links to follow up on would be greatly appreciated.
As you've already pointed out, the cuda runtime has its own heuristic for ordering GPUs and assigning devices indices to them.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this ordering.
For example, suppose that in ordinary use, my display device is enumerated as device 0, and my preferred CUDA GPU is enumerated as device 1. Applications written without any usage of cudaSetDevice, for example, will default to using the device enumerated as 0. If I want to change this, under linux I could use something like:
CUDA_VISIBLE_DEVICES="1" ./my_app
to cause the cuda runtime to enumerate the device that would ordinarily be device 1 as device 0 for this application run (and the ordinary device 0 would be "hidden" from CUDA, in this case). You can make this "permanent" for the session simply by exporting that variable (e.g., bash):
export CUDA_VISIBLE_DEVICES="1"
./my_app
If I simply wanted to reverse the default CUDA runtime ordering, but still make both GPUs available to the application, I could do something like:
CUDA_VISIBLE_DEVICES="1,0" ./deviceQuery
There are other specification options, such as using GPU UUID identifiers (instead of device indices) as provided by nvidia-smi.
Refer to the documentation or this writeup as well.

What does 'compute capability' mean w.r.t. CUDA?

I am new to CUDA programming and don't know much about it. Can you please tell me what does 'CUDA compute capability' mean? When I use the following code on my university server, it showed me the following result.
for (device = 0; device < deviceCount; ++device)
{
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, device);
printf("\nDevice %d has compute capability %d.%d.\n", device, deviceProp.major, deviceProp.minor);
}
RESULT:
Device 0 has compute capability 4199672.0.
Device 1 has compute capability 4199672.0.
Device 2 has compute capability 4199672.0.
.
.
cudaGetDeviceProperties returns two fields major and minor. Can you please tell me what is this 4199672.0. means?
The compute capability is the "feature set" (both hardware and software features) of the device. You may have heard the NVIDIA GPU architecture names "Tesla", "Fermi" or "Kepler". Each of those architectures have features that previous versions might not have.
In your CUDA toolkit installation folder on your hard drive, look for the file CUDA_C_Programming_Guide.pdf (or google it), and find Appendix F.1. It describes the differences in features between the different compute capabilities.
As #dialer mentioned, the compute capability is your CUDA device's set of computation-related features. As NVidia's CUDA API develops, the 'Compute Capability' number increases. At the time of writing, NVidia's newest GPUs are Compute Capability 3.5. You can get some details of what the differences mean by examining this table on Wikipedia.
As #aland suggests, your call probably failed, and what you're getting is the result of using an uninitialized variable. You should wrap your cudaGetDeviceProps() call with some kind of error checking; see
What is the canonical way to check for errors using the CUDA runtime API?
for a discussion of the options for doing this.

How to choose device when running a CUDA executable?

I'm connecting to a GPU cluster from the outside and I have no idea how to select the device on which to run my CUDA programs.
I know there are two Tesla GPU in the cluster, and I'd like to choose one of them.
Any ideas how? How do you choose the device you want to use when there are many connected to your computer?
The canonical way to select a device in the runtime API is using cudaSetDevice. That will configure the runtime to perform lazy context establishment on the nominated device. Prior to CUDA 4.0, this call didn't actually establish a context, it just told the runtime which GPU to try and use. Since CUDA 4.0, this call will establish a context on the nominated GPU at the time of calling. There is also cudaChooseDevice, which will select amongst available devices to find one which matches criteria supplied by the caller.
You can enumerate the available GPUs on a system with cudaGetDeviceCount, and retrieve their particulars using cudaGetDeviceProperties. The SDK deviceQuery example shows full details of how to do this.
You may need to be careful, however, on how you select GPUs in a multi-GPU system, depending on the host and driver configuration. In both the Linux and the Windows TCC driver, there exists the option for GPUs to be marked "compute exculsive", meaning that the driver will limit each GPU to one active context at a time, or compute prohibited, meaning that no CUDA program can establish a context on that device. If your code attempts to establish a context on a compute prohibited device, or on a compute exclusive device which is in use, the result will be an invalid device error. In a multiple GPU system where the policy is to use compute exclusivity, the correct approach is not to try and select a particular GPU, but simply to allow lazy context establishment to happen implicitly. The driver will automagically select a free GPU for your code to run. The compute mode status of any device can be checked by reading the cudaDeviceProp.computeMode field using the cudaGetDeviceProperties call. Note that you are free to check unavailable or prohibited GPUs and query their properties, but any operation which would require context establishment will fail.
See the runtime API documentation on all of these calls
You can set the environment variable CUDA_VISIBLE_DEVICES to a comma-separated list of device IDs to make only those devices visible to an application. Use this either to mask out devices or to change the visibility order of devices so that the CUDA runtime enumerates them in a specific order.