Starting up the CUDA runtime takes a certain amount of time to harmonize the UVM memory maps of the device and the host; see:
cudaGetCacheConfig takes 0.5 seconds - how/why?
slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice
Now, it's been suggested to me that using Persistence Mode would mitigate this phenomenon significantly. In what way? I mean, what will happen, or fail to happen, when persistence mode is on, and a process using CUDA exists?
The documentation says:
Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.
but - what does "keeping initialized" mean? Later, the section about the persistence daemon (which is not the same thing as persistence mode) says:
The GPU state remains loaded in the driver whenever one or more clients have the device file open. Once all clients have closed the device file, the GPU state will be unloaded unless persistence mode is enabled.
So what exactly is unloaded? To where it is unloaded? How does it relate to the memory size? And why would it take so much time to load it back if nothing significant has happend on the system?
There are 2 major pieces to the GPU/CUDA start-up sequence:
Device initialization time
CUDA context "lazy" initialization
A modern CUDA GPU can exist in one of several power states. The current power state is observable via nvidia-smi or via NVML (although note that the effect of running a tool like nvidia-smi may modify the power state of a GPU.) When the GPU is not being used for any purpose (i.e. it is idle, technically: no contexts of any kind are instantiated on the GPU) and persistence mode is not enabled, the GPU, in concert with the GPU driver, will automatically reduce its power state to a very low level, sometimes including a complete power-off scenario.
The process of moving a GPU to a lower power state will involve shutting off or modifying the behavior of various pieces of hardware. For example, reducing memory clocks, reducing core clocks, shutting off display output, shutting off the memory subsystem, shutting off various internal subsystems such as clock generators, and even major parts of the chip, such as the compute cores, caches, etc. and potentially even a "complete" power-down of the chip. A modern GPU has a controllable power delivery system, both on-chip and off-chip, to enable this behavior.
To reverse this process, the GPU driver software must carefully (in a prescribed sequence) power up modules, wait for a hardware settling time, then apply a module-level reset, then begin initializing controlling registers in the module. For example, powering up memory would involve, amongst other things, turning on the on-chip DRAM control module, turning on DRAM power, turning on the memory pin drivers, setting slew rates, turning on the memory clock, initializing the memory clock generator PLL for desired operation, and in many cases, initializing memory to some known state. For example, proper ECC usage requires that memory be initialized to a known state, which may not be simply all zeroes, but involves ECC tags which must be computed and stored. This "ECC Scrub" is one example of a "time-consuming" process mentioned in the documentation.
Depending on the exact power state, there may be any number of things that the driver must do to bring the GPU to the next higher power state (or "performance state"), P0 being the highest state. Once the perf state is above a certain level (say, P8) then the GPU may be capable of supporting certain types of contexts (e.g. a compute context) but perhaps at a reduced performance level (unless you are at P0).
These operations take time, and persistence mode will generally keep the GPU at power/perf state P2 or P0, meaning that essentially none of the above steps must be performed if it is desired that a context be opened on the GPU.
However, opening a GPU context may involve start-up costs of its own, that the GPU cannot or does not keep track of. For example, opening a compute context in a UVA regime requires, among other things, that "virtual allocations" be requested of the host OS, and that the memory maps of all processors in the system (all "visible" GPUs, plus the CPU) be "harmonized" so that everyone has a unique space to work in, and the numerical value of a 64-bit pointer in the space can be used to uniquely determine "ownership" or "meaning/introspection" of that pointer.
For the most part, activities related to opening a CUDA context (other than the process of bringing the device to a state where it can support a context) will not be impacted or benefitted by having the GPU in persistence mode.
Since both device initialization, and CUDA context creation may impact perceived "CUDA startup time", then persistence mode may improve/mitigate the overall perceived start-up time, but it cannot reduce it to zero, since some activities associated with context creation are outside of its purview.
The exact behavior of persistence mode may vary over time and by GPU type. Recently, it seems that persistence mode may still allow GPUs to move down to a power state of P8.
Related
I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.
TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.
I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_DEVICE_MAX_CONNECTIONS environment variable to a higher value if you need more.
I am using Nvidia GTX Titan X to do deep learning experiment.
I am using nvidia-smi to monitor the GPU running state, but the perf(ormance) state the tool provided does not make sense.
I have check out the nvidia-smi manual, it said the following:
Performance State
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
Without running any process on GPU(idle state),the GPU performance state is p0.
However, when running some computation heavy process, the state became p2.
My question is, why my GPU is at P0 state at idle, but switch to P2 when running heavy computation task? Shouldn't it be the opposite?
Also, is there a way to make my GPU always run at P0 state(maximum performance)?
It is confusing.
The nvidia-smi manual is correct, however.
When a GPU or set of GPUs are idle, the process of running nvidia-smi on a machine will usually bring one of those GPUs out of the idle state. This is due to the information that the tool is collecting - it needs to wake up one of the GPUs.
This wake up process will initially bring the GPU to P0 state (highest perf state), but the GPU driver will monitor that GPU, and eventually start to reduce the performance state to save power, if the GPU is idle or not particularly busy.
On the other hand, when the GPUs are active with a workload, the GPU driver will, according to its own heuristic, continuously adjust the performance state to deliver best performance while matching the performance state to the actual workload. If no thermal or power limits are reached, the perf state should reach its highest level (P0) for the most active and heaviest, continuous workloads.
Workloads that are periodically heavy, but not continuous, may see the GPU power state fluctuate around levels P0-P2. GPUs that are "throttled" due to thermal (temperature) or power issues may also see reduced P-states. This type of throttling is evident and reported separately in nvidia-smi, but this type of reporting may not be enabled for all GPU types.
If you want to see the P0 state on your GPU, the best advice I can offer is to run a short, heavy, continuous workload (something that does a large sgemm operation, for example), and then monitor the GPU during that workload. It should be possible to see P0 state in that situation.
If you are using a machine learning application (e.g. Caffe) that is using the cuDNN library, and you are training a large network, it should be possible to see P0 from time to time, because cuDNN does operations that are something like sgemm in this scenario, typically.
But for a sporadic workload, it's quite possible that the most commonly observed state would be P2.
To "force" a P0 power state always, you can try experimenting with the persistence mode and applications clocks via the nvidia-smi tool. Use nvidia-smi --help or the man page for nvidia-smi to understand the options.
Although I don't think this will typically apply to Tesla GPUs, some NVIDIA GPUs may limit themselves to a P2 power state under compute load unless application clocks are specifically set higher. Use the nvidia-smi -a command to see the current Application clocks, the Default Application Clocks, and the Max Clocks available for your GPU. (Some GPUs, including older GPUs, may display N/A for some of these fields. That generally indicates the applications clocks are not modifiable via nvidia-smi.) If a card seems to run at the P2 state during compute load, you may be able to increase it to P0 state by increasing the application clocks to the maximum available (i.e. Max Clocks). Use nvidia-smi --help to learn how to format the command to change the application clocks on your GPU. Modifying application clocks, or enabling modifiable application clocks, may require root/admin privilege. It may also be desirable or necessary to set the GPU Persistence mode. This will prevent the driver from "unloading" during periods of GPU activity, which may cause the application clocks to be reset when the driver re-loads.
This default behavior, for the affected cards in this situation, of limiting to P2 under compute load, is by design of the GPU driver.
This somewhat related question/answer may also be of interest.
I know from CUDA 5.5 that it is possible to have high-priority kernels, but I understand that this is only for calls issues by the same context on the GPU, i.e. it does not effect the priority for another process' kernel launches, as long as they have enough CPU time to be issued.
Is it possible to have high priority applications on the GPU, similarly to how you can set the OS to give priority to specific threads?
Thanks
Henrik Andresen
Technically, CUDA 5.5 added the ability to set individual stream priorities to the CUDA work scheduling system for compute capability 3.5 devices. This means that the CUDA driver will allow any operation kernel in a higher priority stream to preempt the driver level scheduling of operations in competing lower priority streams. This can include launching of kernels, copy operations and events to the device. I don't, however, believe that extends to any difference in execution priority once a kernel has been launched on the device itself, it is purely a driver side stream scheduling heuristic control.. This extends to preemption of running kernels from lower priority streams on the device when conditions for concurrent kernel execution exist.
To the best of my knowledge, there is currently no ability to extend that mechanism beyond streams within a single context, and there is no way to influence how processes competing for access to the GPU will be prioritised at the driver level. The only caveat I add to this is that CUDA 6 might have changed this, and I haven't yet had the opportunity to look at everything that is new in that release.
With CPU and memory it's simple.
A process has a large virtual address space, which is partially mapped into physical memory. When the current process attempts to access a page that is not in physical memory, OS steps in, chooses a page to swap (e.g. with Round Robin), swaps it into disc, then reads the required page from the swap, and the control is returned back to the process. This is straightforward, because the process cannot continue without having that page.
GPU kernels is a different story.
Let's consider a usecase:
A high-priority [cpu] process, namely X, makes a call to kernel (which is a blocking call). At this moment, it is reasonable for OS to switch contexts and give the CPU to a different process, namely Z. For the sake of example, let the process Z also do something heavy with the GPU.
Now, what does the GPU driver do? Does it stop the kernel that belongs to [higher prioritized] X? Does it inform OS that Z isn't prioritized enough to offload kernels of X? In general, what happens when two processes need GPU resources, but the available GPU memory is sufficient to serve only one of them at a time?
CUDA GPUs context-switch cooperatively at a coarse granularity (think "memcpy" or "kernel launch"). If there is enough memory for both contexts, the hardware is happy to cooperatively context switch between them at a slight performance cost. (But because it's cooperative, long-running kernels will interfere with other kernels' execution.)
Modern GPUs do support virtual memory (i.e. memory protection through address translation), but they do NOT support demand paging. That means every piece of memory accessible to the GPU (device memory and mapped pinned memory) must be physically present and mapped after allocation.
The Windows Display Driver Model (WDDM) introduced in Windows Vista does paging at a very coarse granularity. The driver is required to track which "memory objects" are needed to execute a given command buffer, and the OS ensures that they are present. The OS can swap them out when not needed. The wrinkle with CUDA is that since pointers can be stored, all memory objects associated with the CUDA address space must be resident in order to run a CUDA kernel. So the paging doesn't work as well for CUDA as it does for graphics applications, which WDDM was designed to run.