Do CUDA device IDs change when debugging? - cuda

I've noticed that, on a host with two working CUDA SM_2.x devices, the first of which is running the display, calling cudaSetDevice(1) in the debugger throws CUDA error 10 (invalid device). It works fine when executed outside of the debugger, however. I also note that the device which normally has ID 1 has device ID 0 inside the debugger.
Are my suspicions confirmed that device ID 0 is assigned only to the first available device, rather than the device installed in the first PCIe slot?
If so, is there a way of ensuring that e.g. cudaSetDevice(1) always selects the same device, irrespective of how CUDA assigns device IDs?

The really short answer is, no, there is no way to do this. Having said that, hardcoding a fixed device id is never the correct thing to do. You want to either:
Select an id from the list of available devices which the API returns for you (and there are a number of very helpful APIs to let you get the device you want), or
You don't use any explicit device selection at all in your code and rely on appropriate driver compute mode settings and/or the CUDA_VISIBLE_DEVICES environment setting to have the driver automatically select a suitable valid device ID for you.
Which you choose will probably dictated by the environment in which your code ends up being deployed.

Related

GPU device Id Mismatch while calling from keras [duplicate]

Assume on a single node, there are several devcies with different compute capabilities, how nvidia rank them (by rank I mean the number assigned by cudaSetDevice)?
Are there any general guideline about this? thanks.
I believe the ordering of devices corresponding to cudaGetDevice and cudaSetDevice (i.e. the CUDA runtime enumeration order should be either based on a heuristic that determines the fastest device and makes it first or else based on PCI enumeration order. You can confirm this using the deviceQuery sample, which prints the properties of devices (including PCI ID) based on the order they get enumerated in for cudaSetDevice.
However I would recommend not to base any decisions on this. There's nothing magical about PCI enumeration order, and even things like a system BIOS upgrade can change the device enumeration order (as can swapping devices, moving to another system, etc.)
It's usually best to query devices (see the deviceQuery sample) and then make decisions based on the specific devices returned and/or their properties. You can also use cudaChooseDevice to select a device heuristically.
You can cause the CUDA runtime to choose either "Faster First" or "PCI Enumeration Order" based on the setting (or lack of) an environment variable in CUDA 8.

CUDA GPU selected by position, but how to set default to be something other than device 0?

I've recently installed a second GPU (Tesla K40) on my machine at home and my searches have suggested that the first PCI slot becomes the default GPU chosen for CUDA jobs. A great link is explaining it can be found here:
Default GPU Assignment
My original GPU is a TITAN X, also CUDA enabled, but it's really best for single precision calculations and the Tesla better for double precision. My question for the group is whether there is a way to set up my default CUDA programming device to be the second one always? Obviously I can specify in the code each time which device to use, but I'm hoping that I can configure my set such that it will always default to using the Tesla card.
Or is the only way to open the box up and physically swap positions of the devices? Somehow that seems wrong to me....
Any advice or relevant links to follow up on would be greatly appreciated.
As you've already pointed out, the cuda runtime has its own heuristic for ordering GPUs and assigning devices indices to them.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this ordering.
For example, suppose that in ordinary use, my display device is enumerated as device 0, and my preferred CUDA GPU is enumerated as device 1. Applications written without any usage of cudaSetDevice, for example, will default to using the device enumerated as 0. If I want to change this, under linux I could use something like:
CUDA_VISIBLE_DEVICES="1" ./my_app
to cause the cuda runtime to enumerate the device that would ordinarily be device 1 as device 0 for this application run (and the ordinary device 0 would be "hidden" from CUDA, in this case). You can make this "permanent" for the session simply by exporting that variable (e.g., bash):
export CUDA_VISIBLE_DEVICES="1"
./my_app
If I simply wanted to reverse the default CUDA runtime ordering, but still make both GPUs available to the application, I could do something like:
CUDA_VISIBLE_DEVICES="1,0" ./deviceQuery
There are other specification options, such as using GPU UUID identifiers (instead of device indices) as provided by nvidia-smi.
Refer to the documentation or this writeup as well.

Are CUDA streams device-associated? And how do I get a stream's device?

I have a CUDA stream which someone handed to me - a cudaStream_t value. The CUDA Runtime API does not seem to indicate how I can obtain the index of the device with which this stream is associated.
Now, I know that cudaStream_t is just a pointer to a driver-level stream structure, but I'm hesitant to delve into the driver too much. Is there an idiomatic way to do this? Or some good reason not to want to do it?
Edit: Another aspect to this question is whether the stream really is associated with a device in a way in which the CUDA driver itself can determine that device's identity given the pointed-to structure.
Yes, streams are device-specific.
In CUDA, streams are specific to a context, and contexts are specific to a device.
Now, with the runtime API, you don't "see" contexts - you use just one context per device. But if you consider the driver API - you have:
CUresult cuStreamGetCtx ( CUstream hStream, CUcontext* pctx );
CUstream and cudaStream_t are the same thing - a pointer. So, you can get the context. Then, you set or push that context to be the current context (read about doing that elsewhere), and finally, you use:
CUresult cuCtxGetDevice ( CUdevice* device )
to get the current context's device.
So, a bit of a hassle, but quite doable.
My approach to easily determining a stream's device
My workaround for this issue is to have the (C++'ish) stream wrapper class keep (the context and) the device among the member variables, which means that you can write:
auto my_device = cuda::device::get(1);
auto my_stream = my_device.create_stream(); /* using some default param values here */
assert(my_stream.device() == my_device());
and not have to worry about it (+ it won't trigger the extra API calls since, at construction, we know what the current context is and what its device is).
Note: The above snippet is for a system with at least two CUDA devices, otherwise there is no device with index 1...
Regarding to the explicit streams, it is up to the implementation (to the best of my knowledge) there is no API providing this potential query capability to the users; I don't know about the capabilities that the driver can provide for you in this front, however, you can always query the stream.
By using cudaStreamQuery, you can query your targeted stream on your selected device, if it returns cudaSuccess or cudaErrorNotReady it means that the stream does exist on that device and if it returns cudaErrorInvalidResourceHandle, it means that it does not.

A single program appear on two GPU card

I have multiple GPU cards(NO.0, NO.1 ...), and every time I run a caffe process on NO.1 or 2 ... (except 0) card, it will use up 73MiB on the NO.0 card.
For example, in the fig below, process 11899 will use 73MiB on NO.0 card but it actually run on NO.1 card.
Why? Can I disable this feature?
The CUDA driver is like an operating system. It will reserve memory for various purposes when it is active. Certain features, such as managed memory, may cause substantial side-effect allocations to occur (although I don't think this is the case with Caffe). And its even possible that the application itself may be doing some explicit allocations on those devices, for some reason.
If you want to prevent this, one option is to use the CUDA_VISIBLE_DEVICES environment variable when you launch your process.
For example, if you want to prevent CUDA from doing anything with card "0", you could do something like this (on linux):
CUDA_VISIBLE_DEVICES="1,2" ./my_application ...
Note that the enumeration used above (the CUDA enumeration) is the same enumeration that would be reported by the deviceQuery sample app, but not necessarily the same enumeration reported by nvidia-smi (the NVML enumeration). You may need to experiment or else run deviceQuery to determine which GPUs you want to use, and which you want to exclude.
Also note that using this option actually affects the devices that are visible to an application, and will cause a re-ordering of device enumeration (the device that was previously "1" will appear to be enumerated as device "0", for example). So if your application is multi-GPU aware, and you are selecting specific devices for use, you may need to change the specific devices you (or the application) are selecting, when you use this environment variable.

How do the nVIDIA drivers assign device indices to GPUs?

Assume on a single node, there are several devcies with different compute capabilities, how nvidia rank them (by rank I mean the number assigned by cudaSetDevice)?
Are there any general guideline about this? thanks.
I believe the ordering of devices corresponding to cudaGetDevice and cudaSetDevice (i.e. the CUDA runtime enumeration order should be either based on a heuristic that determines the fastest device and makes it first or else based on PCI enumeration order. You can confirm this using the deviceQuery sample, which prints the properties of devices (including PCI ID) based on the order they get enumerated in for cudaSetDevice.
However I would recommend not to base any decisions on this. There's nothing magical about PCI enumeration order, and even things like a system BIOS upgrade can change the device enumeration order (as can swapping devices, moving to another system, etc.)
It's usually best to query devices (see the deviceQuery sample) and then make decisions based on the specific devices returned and/or their properties. You can also use cudaChooseDevice to select a device heuristically.
You can cause the CUDA runtime to choose either "Faster First" or "PCI Enumeration Order" based on the setting (or lack of) an environment variable in CUDA 8.