nvidia-smi -ac equivalent in NVML - cuda

I learnt than nvidia-smi -ac can be used to change the clock
rate of GPU cores and memory. Is nvidia-smi built upon the NVML library?
What is its equivalent in NVML since I checked the document
http://cyber.sibsutis.ru:82/GPGPU/sdk/CUDA_TOOLKIT/nvml.pdf
but could only see the API's which are used to get the values of clock rates rather than
setting them?
Thanks

Yes, nvidia-smi is built on the NVML library.
According to the latest nvml api documentation available here (which is linked from the site I previously suggested to you here) The "Set Application Clocks" command is supported on Tesla K10 and K20 GPUs (page 6). I believe it is also supported on "Kepler" members of the Quadro family, such as Quadro K5000.
If you have a Tesla K10, K20, or K20X GPU, the Set Application Clocks command is described on p68, which I am also reproducing here for convenience:
7.12.2.2 nvmlReturn_t DECLDIR nvmlDeviceSetApplicationsClocks (nvmlDevice_t device, unsigned int
memClockMHz, unsigned int graphicsClockMHz)
Set clocks that applications will lock to.
Sets the clocks that compute and graphics applications will be running at. e.g. CUDA driver requests these clocks
during context creation which means this property defines clocks at which CUDA applications will be running unless
some overspec event occurs (e.g. over power, over thermal or external HW brake).
Can be used as a setting to request constant performance.
For Tesla ™products, and Quadro ®products from the Kepler family. Requires root/admin permissions.
See nvmlDeviceGetSupportedMemoryClocks and nvmlDeviceGetSupportedGraphicsClocks for details on how to list
available clocks combinations.
After system reboot or driver reload applications clocks go back to their default value.
Parameters:
device The identifier of the target device
memClockMHz Requested memory clock in MHz
graphicsClockMHz Requested graphics clock in MHz
Returns:
• NVML_SUCCESS if new settings were successfully set
• NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
• NVML_ERROR_INVALID_ARGUMENT if device is invalid or memClockMHz and graphicsClockMHz is
not a valid clock combination
• NVML_ERROR_NO_PERMISSION if the user doesn’t have permission to perform this operation
• NVML_ERROR_NOT_SUPPORTED if the device doesn’t support this feature
• NVML_ERROR_UNKNOWN on any unexpected error

Related

How to control the resource of each client in NVIDIA-MPS

In nvidia-mps, we launch the mps-server by running sudo nvidia-cuda-mps-control -d, I have two questions.
How to specify which GPU to run mps-server when I have multiple GPUs on the same server.
How to control the resources (such as computation and memory) allocated for each mps client when I have multiple concurrent processes?
The CUDA MPS doc will answer many questions like this.
How to specify which GPU to run mps-server when I have multiple GPUs on the same server.
From the CUDA MPS doc, section 2.3.4, the GPUs that are visible (via CUDA_VISIBLE_DEVICES) when the MPS server is started, will determine which GPUs it will use:
2.3.4. MPS on Multi-GPU Systems
The MPS server supports using multiple GPUs. On systems with more than one GPU,
you can use CUDA_VISIBLE_DEVICES to enumerate the GPUs you would like to use.
See section 4.2 for more details.
How to control the resources (such as computation and memory) allocated for each mps client when I have multiple concurrent processes?
From the same doc, section 2.3.5.2, the primary method for computation allocation per process is via the setting of the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE. The setting of this environment variable when the process begins and initializes the CUDA runtime or driver API will determine its available compute resources (SMs) expressed as a percentage. If you have multiple GPUs, it will be the percentage of the SM resources on the GPU selected by your application using cudaSetDevice() or similar.
MPS doesn't provide a mechanism for per-process memory allocation/partitioning, at this time.
EDIT: Update - CUDA 11.5 made publicly available on October 20th, 2021 adds a new feature allowing per-client memory limits in MPS.

Can I fix my GPU clock rate to ensure consistent profiling results?

I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.
TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.

NvLink or PCIe, how to specify the interconnect?

My cluster is equipped with both Nvlink and PCIe. All the GPUs(V100) can communicate directly through both PCIe or NvLink. To my knowledge, both PCIe switch and Nvlink can support the direct link through using CUDA.
Now, I want to compare the peer-to-peer communication performance of PCIe and NvLink. However, I don't know how to specify one, it seems CUDA will always automatically specify one. Could anyone help me?
If two GPUs in CUDA have a direct NVLink connection between them, and you enable Peer-to-Peer transfers, those transfers will flow over NVLink. There is no method of any kind in CUDA to alter this behavior.
If you do not enable Peer-to-Peer transfers, then data transfers (e.g. cudaMemcpy, cudaMemcpyAsync, cudaMemcpyPeerAsync) between those two devices will flow from the source GPU over PCIE to the CPU socket, (perhaps traversing intermediate PCIE switches, perhaps also flowing over a socket-level link such as QPI) and then over PCIE from the CPU socket to the other GPU. At least one CPU socket will always be involved, even if a shorter direct path exists across the PCIE fabric. This behavior is also not modifiable in any fashion available to the programmer.
Both methodologies are demonstrated using the p2pBandwidthLatencyTest CUDA sample code.
The accepted answer -- from an NVIDIA employee -- was correct in 2018. But at some point, NVIDIA added an (undocumented?) option to the driver.
On Linux, you can now put this in /etc/modprobe.d/disable-nvlink.conf:
options nvidia NVreg_NvLinkDisable=1
This will disable NVLink when the driver is next loaded, forcing GPU peer-to-peer communication to use the PCIe interconnect. This gadget exists in driver 515.65.01 (CUDA 11.7.1). I am not sure when it was added.
As for "there is no reason to allow the end-user to choose the slower path", the very existence of this SO question suggests otherwise. In my case, we buy not one server, but dozens... And in the process of choosing our configuration, it is nice to use a single prototype system to benchmark our application using either NVLink or PCIe.

NVML Power readings with nvmlDeviceGetPowerUsage

I'm running an application using the NVML function nvmlDeviceGetPowerUsage().
The problem is that I always get the same number for different applications I'm running using on a TESLA M2050.
Any suggestions?
If you read the documentation, you'll discover that there are some qualifiers on whether this function is available:
For "GF11x" Tesla ™and Quadro ®products from the Fermi family.
• Requires NVML_INFOROM_POWER version 3.0 or higher.
For Tesla ™and Quadro ®products from the Kepler family.
• Does not require NVML_INFOROM_POWER object.
And:
It is only available if power management mode is supported. See nvmlDeviceGetPowerManagementMode.
I think you'll find that power management mode is not supported on the M2050, and if you run that nvmlDeviceGetPowerManagementMode API call on your M2050 device, you'll get confirmation of that.
The M2050 is niether a Kepler GPU nor is it a GF11x Fermi GPU. It is using the GF100 Fermi GPU, so it is not covered by this API capability (and the GetPowerManagementMode API call would confirm that.)

cublas failed to synchronize stop event?

I'm playing with the matrixMulCUBLAS sample code and tried changing the default matrix sizes to something slightly more fun rows=5k x cols=2.5k and then the example fails with the error Failed to synchronize on the stop event (error code unknown error)! at line #377 when all the computation is done and it is apparently cleaning up cublas. What does this mean? and how can be fixed?
I've got cuda 5.0 installed with an EVGA FTW nVidia GeForce GTX 670 with 2GB memory. The driver version is 314.22 latest one as of today.
In general, when using CUDA on windows, it's necessary to make sure the execution time of a single kernel is not longer than about 2 seconds. If the execution time becomes longer, you may hit a windows TDR event. This is a windows watchdog timer that will reset the GPU driver if it does not respond within a certain period of time. Such a reset halts the execution of your kernel and generates bogus results, as well as usually a briefly "black" display and a brief message in the system tray. If your kernel execution is triggering the windows watchdog timer, you have a few options:
If you have the possibility to use more than one GPU in your system (i.e. usually not talking about a laptop here) and one of your GPUs is a Quadro or Tesla device, the Quadro or Tesla device can usually be placed in TCC mode. This will mean that GPU can no longer driver a physical display (if it was driving one) and that it is removed from the WDDM subsystem, so is no longer subject to the watchdog timer. You can use the nvidia-smi.exe tool that ships with the NVIDIA GPU driver to modify the setting from WDDM to TCC for a given GPU. Use your windows file search function to find nvidia-smi.exe and then use nvidia-smi --help to get command line help for how to switch from WDDM to TCC mode.
If the above method is not available to you (don't have 2 GPUs, don't have a Quadro or Tesla GPU...) then you may want to investigate changing the watchdog timer setting. Unfortunately this requires modifying the system registry, and the process and specific keys vary by OS. There are a number of resources on the web, such as here from Microsoft, as well as other questions on Stack Overflow, such as here, which may help with this.
A third option is simply to limit the execution time of your kernel(s). Successive operations might be broken into multiple kernel calls. The "gap" between kernel calls will allow the display driver to respond to the OS, and prevent the watchdog timeout.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.