A single program appear on two GPU card - cuda

I have multiple GPU cards(NO.0, NO.1 ...), and every time I run a caffe process on NO.1 or 2 ... (except 0) card, it will use up 73MiB on the NO.0 card.
For example, in the fig below, process 11899 will use 73MiB on NO.0 card but it actually run on NO.1 card.
Why? Can I disable this feature?

The CUDA driver is like an operating system. It will reserve memory for various purposes when it is active. Certain features, such as managed memory, may cause substantial side-effect allocations to occur (although I don't think this is the case with Caffe). And its even possible that the application itself may be doing some explicit allocations on those devices, for some reason.
If you want to prevent this, one option is to use the CUDA_VISIBLE_DEVICES environment variable when you launch your process.
For example, if you want to prevent CUDA from doing anything with card "0", you could do something like this (on linux):
CUDA_VISIBLE_DEVICES="1,2" ./my_application ...
Note that the enumeration used above (the CUDA enumeration) is the same enumeration that would be reported by the deviceQuery sample app, but not necessarily the same enumeration reported by nvidia-smi (the NVML enumeration). You may need to experiment or else run deviceQuery to determine which GPUs you want to use, and which you want to exclude.
Also note that using this option actually affects the devices that are visible to an application, and will cause a re-ordering of device enumeration (the device that was previously "1" will appear to be enumerated as device "0", for example). So if your application is multi-GPU aware, and you are selecting specific devices for use, you may need to change the specific devices you (or the application) are selecting, when you use this environment variable.

Related

Can I fix my GPU clock rate to ensure consistent profiling results?

I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.
TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.

GPU device Id Mismatch while calling from keras [duplicate]

Assume on a single node, there are several devcies with different compute capabilities, how nvidia rank them (by rank I mean the number assigned by cudaSetDevice)?
Are there any general guideline about this? thanks.
I believe the ordering of devices corresponding to cudaGetDevice and cudaSetDevice (i.e. the CUDA runtime enumeration order should be either based on a heuristic that determines the fastest device and makes it first or else based on PCI enumeration order. You can confirm this using the deviceQuery sample, which prints the properties of devices (including PCI ID) based on the order they get enumerated in for cudaSetDevice.
However I would recommend not to base any decisions on this. There's nothing magical about PCI enumeration order, and even things like a system BIOS upgrade can change the device enumeration order (as can swapping devices, moving to another system, etc.)
It's usually best to query devices (see the deviceQuery sample) and then make decisions based on the specific devices returned and/or their properties. You can also use cudaChooseDevice to select a device heuristically.
You can cause the CUDA runtime to choose either "Faster First" or "PCI Enumeration Order" based on the setting (or lack of) an environment variable in CUDA 8.

CUDA GPU selected by position, but how to set default to be something other than device 0?

I've recently installed a second GPU (Tesla K40) on my machine at home and my searches have suggested that the first PCI slot becomes the default GPU chosen for CUDA jobs. A great link is explaining it can be found here:
Default GPU Assignment
My original GPU is a TITAN X, also CUDA enabled, but it's really best for single precision calculations and the Tesla better for double precision. My question for the group is whether there is a way to set up my default CUDA programming device to be the second one always? Obviously I can specify in the code each time which device to use, but I'm hoping that I can configure my set such that it will always default to using the Tesla card.
Or is the only way to open the box up and physically swap positions of the devices? Somehow that seems wrong to me....
Any advice or relevant links to follow up on would be greatly appreciated.
As you've already pointed out, the cuda runtime has its own heuristic for ordering GPUs and assigning devices indices to them.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this ordering.
For example, suppose that in ordinary use, my display device is enumerated as device 0, and my preferred CUDA GPU is enumerated as device 1. Applications written without any usage of cudaSetDevice, for example, will default to using the device enumerated as 0. If I want to change this, under linux I could use something like:
CUDA_VISIBLE_DEVICES="1" ./my_app
to cause the cuda runtime to enumerate the device that would ordinarily be device 1 as device 0 for this application run (and the ordinary device 0 would be "hidden" from CUDA, in this case). You can make this "permanent" for the session simply by exporting that variable (e.g., bash):
export CUDA_VISIBLE_DEVICES="1"
./my_app
If I simply wanted to reverse the default CUDA runtime ordering, but still make both GPUs available to the application, I could do something like:
CUDA_VISIBLE_DEVICES="1,0" ./deviceQuery
There are other specification options, such as using GPU UUID identifiers (instead of device indices) as provided by nvidia-smi.
Refer to the documentation or this writeup as well.

How do the nVIDIA drivers assign device indices to GPUs?

Assume on a single node, there are several devcies with different compute capabilities, how nvidia rank them (by rank I mean the number assigned by cudaSetDevice)?
Are there any general guideline about this? thanks.
I believe the ordering of devices corresponding to cudaGetDevice and cudaSetDevice (i.e. the CUDA runtime enumeration order should be either based on a heuristic that determines the fastest device and makes it first or else based on PCI enumeration order. You can confirm this using the deviceQuery sample, which prints the properties of devices (including PCI ID) based on the order they get enumerated in for cudaSetDevice.
However I would recommend not to base any decisions on this. There's nothing magical about PCI enumeration order, and even things like a system BIOS upgrade can change the device enumeration order (as can swapping devices, moving to another system, etc.)
It's usually best to query devices (see the deviceQuery sample) and then make decisions based on the specific devices returned and/or their properties. You can also use cudaChooseDevice to select a device heuristically.
You can cause the CUDA runtime to choose either "Faster First" or "PCI Enumeration Order" based on the setting (or lack of) an environment variable in CUDA 8.

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the accelerated CUDA code. What I want is to first check whether the user's system has a CUDA enabled NVIDIA card and if it does, initializing the CUDA runtime after. If the system does not support CUDA, then I want to execute the CPU path. This question is very similar to mine, but I don't want to use any other libraries other than the plain CUDA runtime. OpenCL is an alternative, but there isn't enough time to implement an OpenCL version of the algorithm for the first release. Without any CUDA existence check, the program will surely crash since it can't find the needed .dll's for the CUDA runtime and we surely don't want that. So, I need advices on how to handle this initialization step.
Use the calls cudaGetDeviceCount and cudaGetDeviceProperties to find CUDA devices on the running system. First find out how many, then loop through all the available devices, and inspect the properties to decide which ones qualify. What I mean by "qualify" depends on your application. Do you want to require a certain compute capability? Or need a certain amount of memory? If there's more than one device, you might want to sort on some criteria, then set the device cudaSetDevice. If there are no devices, or none that are sufficient, then fall back on the CPU code path.
I'd also suggest having some mechanism to force CUDA mode off, in case some user's environment just doesn't work due to driver issues, or an old board, or something else. You can use a command-line option, or an environment variable, or whatever...
EDITING:
Regarding DLLs, you should package cudart[whatever].dll with your application. That will ensure that the program starts, and at least the CUDA query functions will operate.