nvidia-smi GPU performance measure does not make sense

nvidia-smi GPU performance measure does not make sense - cuda

I am using Nvidia GTX Titan X to do deep learning experiment.
I am using nvidia-smi to monitor the GPU running state, but the perf(ormance) state the tool provided does not make sense.
I have check out the nvidia-smi manual, it said the following:
Performance State
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
Without running any process on GPU(idle state),the GPU performance state is p0.
However, when running some computation heavy process, the state became p2.
My question is, why my GPU is at P0 state at idle, but switch to P2 when running heavy computation task? Shouldn't it be the opposite?
Also, is there a way to make my GPU always run at P0 state(maximum performance)?

It is confusing.
The nvidia-smi manual is correct, however.
When a GPU or set of GPUs are idle, the process of running nvidia-smi on a machine will usually bring one of those GPUs out of the idle state. This is due to the information that the tool is collecting - it needs to wake up one of the GPUs.
This wake up process will initially bring the GPU to P0 state (highest perf state), but the GPU driver will monitor that GPU, and eventually start to reduce the performance state to save power, if the GPU is idle or not particularly busy.
On the other hand, when the GPUs are active with a workload, the GPU driver will, according to its own heuristic, continuously adjust the performance state to deliver best performance while matching the performance state to the actual workload. If no thermal or power limits are reached, the perf state should reach its highest level (P0) for the most active and heaviest, continuous workloads.
Workloads that are periodically heavy, but not continuous, may see the GPU power state fluctuate around levels P0-P2. GPUs that are "throttled" due to thermal (temperature) or power issues may also see reduced P-states. This type of throttling is evident and reported separately in nvidia-smi, but this type of reporting may not be enabled for all GPU types.
If you want to see the P0 state on your GPU, the best advice I can offer is to run a short, heavy, continuous workload (something that does a large sgemm operation, for example), and then monitor the GPU during that workload. It should be possible to see P0 state in that situation.
If you are using a machine learning application (e.g. Caffe) that is using the cuDNN library, and you are training a large network, it should be possible to see P0 from time to time, because cuDNN does operations that are something like sgemm in this scenario, typically.
But for a sporadic workload, it's quite possible that the most commonly observed state would be P2.
To "force" a P0 power state always, you can try experimenting with the persistence mode and applications clocks via the nvidia-smi tool. Use nvidia-smi --help or the man page for nvidia-smi to understand the options.
Although I don't think this will typically apply to Tesla GPUs, some NVIDIA GPUs may limit themselves to a P2 power state under compute load unless application clocks are specifically set higher. Use the nvidia-smi -a command to see the current Application clocks, the Default Application Clocks, and the Max Clocks available for your GPU. (Some GPUs, including older GPUs, may display N/A for some of these fields. That generally indicates the applications clocks are not modifiable via nvidia-smi.) If a card seems to run at the P2 state during compute load, you may be able to increase it to P0 state by increasing the application clocks to the maximum available (i.e. Max Clocks). Use nvidia-smi --help to learn how to format the command to change the application clocks on your GPU. Modifying application clocks, or enabling modifiable application clocks, may require root/admin privilege. It may also be desirable or necessary to set the GPU Persistence mode. This will prevent the driver from "unloading" during periods of GPU activity, which may cause the application clocks to be reset when the driver re-loads.
This default behavior, for the affected cards in this situation, of limiting to P2 under compute load, is by design of the GPU driver.
This somewhat related question/answer may also be of interest.

Related

Can I fix my GPU clock rate to ensure consistent profiling results?

I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.

TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.

What does "persistence mode" actually do which reduces CUDA startup time?

Starting up the CUDA runtime takes a certain amount of time to harmonize the UVM memory maps of the device and the host; see:
cudaGetCacheConfig takes 0.5 seconds - how/why?
slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice
Now, it's been suggested to me that using Persistence Mode would mitigate this phenomenon significantly. In what way? I mean, what will happen, or fail to happen, when persistence mode is on, and a process using CUDA exists?
The documentation says:
Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.
but - what does "keeping initialized" mean? Later, the section about the persistence daemon (which is not the same thing as persistence mode) says:
The GPU state remains loaded in the driver whenever one or more clients have the device file open. Once all clients have closed the device file, the GPU state will be unloaded unless persistence mode is enabled.
So what exactly is unloaded? To where it is unloaded? How does it relate to the memory size? And why would it take so much time to load it back if nothing significant has happend on the system?

There are 2 major pieces to the GPU/CUDA start-up sequence:
Device initialization time
CUDA context "lazy" initialization
A modern CUDA GPU can exist in one of several power states. The current power state is observable via nvidia-smi or via NVML (although note that the effect of running a tool like nvidia-smi may modify the power state of a GPU.) When the GPU is not being used for any purpose (i.e. it is idle, technically: no contexts of any kind are instantiated on the GPU) and persistence mode is not enabled, the GPU, in concert with the GPU driver, will automatically reduce its power state to a very low level, sometimes including a complete power-off scenario.
The process of moving a GPU to a lower power state will involve shutting off or modifying the behavior of various pieces of hardware. For example, reducing memory clocks, reducing core clocks, shutting off display output, shutting off the memory subsystem, shutting off various internal subsystems such as clock generators, and even major parts of the chip, such as the compute cores, caches, etc. and potentially even a "complete" power-down of the chip. A modern GPU has a controllable power delivery system, both on-chip and off-chip, to enable this behavior.
To reverse this process, the GPU driver software must carefully (in a prescribed sequence) power up modules, wait for a hardware settling time, then apply a module-level reset, then begin initializing controlling registers in the module. For example, powering up memory would involve, amongst other things, turning on the on-chip DRAM control module, turning on DRAM power, turning on the memory pin drivers, setting slew rates, turning on the memory clock, initializing the memory clock generator PLL for desired operation, and in many cases, initializing memory to some known state. For example, proper ECC usage requires that memory be initialized to a known state, which may not be simply all zeroes, but involves ECC tags which must be computed and stored. This "ECC Scrub" is one example of a "time-consuming" process mentioned in the documentation.
Depending on the exact power state, there may be any number of things that the driver must do to bring the GPU to the next higher power state (or "performance state"), P0 being the highest state. Once the perf state is above a certain level (say, P8) then the GPU may be capable of supporting certain types of contexts (e.g. a compute context) but perhaps at a reduced performance level (unless you are at P0).
These operations take time, and persistence mode will generally keep the GPU at power/perf state P2 or P0, meaning that essentially none of the above steps must be performed if it is desired that a context be opened on the GPU.
However, opening a GPU context may involve start-up costs of its own, that the GPU cannot or does not keep track of. For example, opening a compute context in a UVA regime requires, among other things, that "virtual allocations" be requested of the host OS, and that the memory maps of all processors in the system (all "visible" GPUs, plus the CPU) be "harmonized" so that everyone has a unique space to work in, and the numerical value of a 64-bit pointer in the space can be used to uniquely determine "ownership" or "meaning/introspection" of that pointer.
For the most part, activities related to opening a CUDA context (other than the process of bringing the device to a state where it can support a context) will not be impacted or benefitted by having the GPU in persistence mode.
Since both device initialization, and CUDA context creation may impact perceived "CUDA startup time", then persistence mode may improve/mitigate the overall perceived start-up time, but it cannot reduce it to zero, since some activities associated with context creation are outside of its purview.
The exact behavior of persistence mode may vary over time and by GPU type. Recently, it seems that persistence mode may still allow GPUs to move down to a power state of P8.

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you

The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.

Kernel time increases for same number of particles

I am trying to run my code on NVIDIA's K10 GPU. I am using 5.0 CUDA Driver and 4.2 CUDA runtime. The problem is that the time taken by the kernel increases with iterations, where each iteration uses the same number of sources and targets (or particles). Because of this, the kernel eventually takes very large times, and the code crashes with runtime error, which says something like "GPU fallen off the bus".
The plot showing the behavior of increasing kernel run time with number of iterations can be seen here:
https://docs.google.com/open?id=0B5QLL4ig3LVqODdmVjNBTlp5UFU
I tried to run the NVIDIA "nbody" example to understand if the same thing happens here too, and yes it does. For the number of particles/bodies (Np) = 1e5 and 10 iterations, code runs fine. For Np=1e5 and iterations= 100, OR Np=1e6 and iterations = 10, code goes into a mode where it hangs the entire system.
When I run my own kernel as well as NVIDIA's nbody example on a different machine with Tesla C2050 NVIDIA card (CUDA Driver version: 3.2, and runtime version: 3.2), there is no problem, and kernel takes the same amount of time for every iteration.
I am trying to understand whats going on in the machine with the K10 GPU. I have tried different combinations of CUDA driver and runtime versions on this machine, and here is what I get:
For 5.0 CUDA Driver, 4.2 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
For 4.2 CUDA Driver, 4.2 Runtime, the codes (nbody as well as my code) crash with error: "CUDA Runtime API error 39: uncorrectable ECC error encountered."
For 5.0 CUDA Driver, 5.0 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
This is a 64-bit linux machine, which we have recently assembled with NVIDIA K10 GPU card. I am using gfortran44 and gcc44.
Please let me know if any other info. is required to track the problem.
Thanks in advance for the help!
M

I'm mostly just creating an answer so we can call this question closed, but I'll try to add a few details.
Tesla GPUs come in 2 distinct categories: those with a fan, and those without. Those with a fan carry (at this time) the "C" designation, although the K20 product family naming will be slightly different:
These are not exhaustive lists:
Tesla GPUs with a Fan: C870, C1060, C2050, C2070, C2075, K20c ("C Class")
Tesla GPUs without a Fan: M1060, M2050, M2070, M2075, M2090, K10, K20, K20X ("M class")
(note that there is currently no K10 type product with a fan or "C" designation)
Tesla GPUs with a fan are designed to be plugged into a wide variety of PC boxes and chassis, including various workstation and server variants. Since they have their own fan, they require a supply of inlet air that is below a certain temperature level, but given that, they will keep themselves cool. As the workload increases, and the generated heat increases, they will spin up their own fan to keep themselves cool. The main ways you can screw up this process are by either restricting the inlet air flow or by putting it in an ambient air environment that is hotter than its max inlet spec.
Tesla GPUs without a fan have something called a passive heatsink and they cannot keep themselves cool independently and take a passive role in the cooling process. They still have a temperature sensor, but it becomes the responsibility of the server BMC (baseboard management controller) to monitor this temperature sensor (this is done directly at the hardware/firmware level, independent of any OS or any activity being directed at the GPU), and to direct a level of airflow over the card that is sufficient to keep the card cool based on it's indicated temperature. The BMC does this by ramping up whatever fans are designed into the server chassis that control airflow over the GPU. Normally there will be shrouding/ducting within the chassis to aid in this process. Server manufacturers integrating these cards have a variety of responsibilities and must follow various technical specifications from NVIDIA in order to make this work.
If you happen to get your hands on a Tesla GPU without a fan and just slap it in some random chassis, you're pretty much guaranteed to have the behavior as described in this question. For this reason, Tesla "M" series and "K" series GPUs are normally only sold to OEMs who have undergone the qualification process.
Since the average sysadmin/system assembler is not likely to devise a suitable closed loop fan control system and normally does not have easy access to the necessary specifications defining the temperature sensor and access method, the only klugey workaround if you have one of these that you simply must play with, is to direct a high level of continuous airflow over the card, in whatever setting you put it. Be advised, that this will most likely be noisy. If you don't have a noisy level of airflow, you probably do not have enough airflow to keep a card cool that is in a high workload situation. In addition, you should probably keep an eye on GPU temps. Note that the nvidia-smi method for monitoring GPU temps does not work for all M class GPUs (i.e. GPUs without a fan). Unfortunately, the method of temperature sensor access in Fermi and prior for the M class GPUs (different than the C class GPUs) was such that it could not be readily monitored in-system via the nvidia-smi command, so in these cases you will get no temperature reading from nvidia-smi, making this approach even harder to manage. Things changed with the Kepler generation, so now the temperature can be monitored both by the nvidia-smi method and by the server BMC at the hardware/firmware level.
C class products with a fan have a temperature that can be monitored with nvidia-smi, regardless of generation. But this is normally not necessary since the card has it's own control system to keep itself cool.
As mentioned in the comments, all GPUs also have a variety of protection mechanisms, none of which are guaranteed to prevent damage. (If you throw the card in a fire, there's nothing to be done about that.) But the first typical mechanism is thermal throttling. At some predefined high temperature near the maximum safe operating range of the GPU, the GPU firmware will independently reduce its clocks to attempt to prevent further temperature rise. (If the card is clocked slower, then generally it's ability to generate heat is also somewhat reduced.) This is a crude mechanism, and when this thermal throttling occurs, something in the cooling arena is already wrong. The card is designed to not enter thermal throttling ever, under normal operating conditions. If temperatures continue to rise (and there is not much headroom at this point), the card will enter it's final protection mode which is to halt itself. At this point the GPU has become unresponsive to the system, and at the OS level, messages like "gpu has fallen of the bus" are typical. This means cooling has failed and protection mechanisms have failed.

Is there any power mode selection feature for nvidia gpu?

Is there any power mode selection feature for nvidia gpu? for example, normal, high performance, or power save mode? If so, is it possible to select a power mode when I compile my cuda program by nvcc? And is it possible to check current power mode status of my gpu?
Actually I could not find any clue of this, though have searched a little long time from web.

Depending on your GPU you may be able to use nvidia-smi or NVML (check out the documentation for more information) to read the current power state, the GPU will dynamically change the power state to conserve power when idle and to provide performance when under load.
As a user it is not possible to set the power state of the GPU - the Tesla product line does have power-capping for the server products but that's not under user control obviously.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008