Find the devices of an iommu group - iommu

I am using the IOMMU API for linux and I would like to get a specific device that belongs to a specific group of a known group ID.
The iommu_group structure has a field for the device list but it is not accessible. Is there a way to get it?

Please try:
find /sys/kernel/iommu_groups/ -type l
Rami Rosen

Question too short. Neither the computer description, neither level of your knowledge. Also the time is a little passed.
Well, iommu groups are mainly used for passing the device through the host computer to hosted virtual machine.
You definitely must have mother-board, BIOS, CPU, and kernel supporting virtual technologies with all necessary switches enabled and modules loaded. Than you can list a pci devices and their iommu grouping. By using Linux kernel > 4.2 (I use kernel 4.8 in Debian 9) you can simply type:
# dmesg |egrep group |awk '{print $NF" "$0}' |sort -n
as root to obtain the group sorted listing of PCI devices.
There are standard and shortned methods how to unbind the group member devices from the kernel driver and how to rebind it with dummy pci-stub or vfio-pci driver.
In case I told here something you know, sorry you didnot tell me enough I can note it. :-)
J.

Related

Can I fix my GPU clock rate to ensure consistent profiling results?

I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.
TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.

How do I make sure Vulkan is using the same GPU as CUDA?

I'm using an application that uses both vulkan and cuda (specifically pytorch) on an HPC cluster (univa grid engine).
When a job is submitted, the cluster scheduler sets an environment variable SGE_HGR_gpu which contains a GPU ID for the job to use (so other jobs run by other users do not use the same GPU)
The typical way to tell an application that uses CUDA to use a specific GPU is to set CUDA_VISIBLE_DEVICES=n
As i'm also using Vulkan, I dont know how to make sure that I choose the same device from those that are listed with vkEnumeratePhysicalDevices.
I think that the order of the values that 'n' can take is the same as the order of the devices on the PCI BUS, however I dont know if the order of the devices returned by vkEnumeratePhysicalDevices are in this order, and the documentation does not specify what this order is.
So how can I go about making sure i'm choosing the same physical GPU for both Vulkan and CUDA?
With VkPhysicalDeviceIDPropertiesKHR (Vulkan 1.1) resp VkPhysicalDeviceVulkan11Properties (Vulkan 1.2) you can get device UUID, which is one of the formats CUDA_VISIBLE_DEVICES seems to use. You should also be able to convert index to UUID (or vice versa) with nvidia-smi -L (or with NVML library).
Or other way around, cudaDeviceProp includes PCI info which could be compared to VK_EXT_pci_bus_info extensions output.
If the order matches in Vulkan, it is best to ask NVidia directly; I cannot find info how NV orders them. IIRC from the Vulkan Loader implementation, the order should match the order from registry, and then the order the NV driver itself gives them. Even so you would have to filter non-NV GPUs from the list in generic code, and you do not know if the NV Vulkan ICD implementation matches CUDA without asking NV.

Cuda Compute Mode and 'CUBLAS_STATUS_ALLOC_FAILED'

I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.

How to use GPUDirect RDMA with Infiniband

I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations.
I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI has support for the feature I am asking. But OpenMPI is too heavy weight for this simple task and it does not support multiple GPUs in a single process.
I wonder if I could get any help with directly using the driver to do the communication. Code sample, tutorial, anything would be good. Also, I would appreciate it if anyone could help me find the code dealing with this in OpenMPI.
For GPUDirect RDMA to work, you need the following installed:
Mellanox OFED installed (from http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers )
Recent NVIDIA CUDA suite installed
Mellanox-NVIDIA GPUDirect plugin (from the link you gave above - posting as guest prevents me from posting links :( )
All of the above should be installed (by the order listed above), and the relevant modules loaded.
After that, you should be able to register memory allocated on the GPU video memory for RDMA transactions. Sample code will look like:
void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);
This will create (on a GPUDirect RDMA enabled system) a memory region, with a valid memory key that you can use for RDMA transactions with our HCA.
For more details about using RDMA and InfiniBand verbs in your code, you can refer to this document.

restrict OpenCL access to Intel CPU?

It is currently possible to restrict OpenCL access to an NVIDIA GPU on Linux using the CUDA_VISIBLE_DEVICES env variable. Is anyone aware of a similar way to restrict OpenCL access to Intel CPU devices? (Motivation: I'm trying to force users of a compute server to run their OpenCL programs through SLURM exclusively.)
One possibility is to link directly to the Intel OpenCL library (libintelocl.so on my system) instead of going through the OpenCL ICD loader.
In pure OpenCL, the way to avoid assigning tasks to the CPU is to not select it (as platform or device). clGetDeviceIDs can do that using the device_type argument (don't set the CL_DEVICE_TYPE_CPU bit).
At the ICD level, I guess you could exclude the CPU driver if it's Intel's implementation; for AMD, it gets a little trickier since they have one driver for both platforms (it seems the CPU_MAX_COMPUTE_UNITS environment variable can restrict it to one core, but not disable it).
If the goal is to restrict OpenCL programs to running through a specific launcher, such as slurm, one way might be to add a group for that launcher and just make the OpenCL ICD vendor files in /etc/OpenCL (and possibly driver device nodes) usable only by that group.
None of this would prevent a user from having their own OpenCL implementation in place to run on CPU, but it could be enough to guide them to not run there by mistake.