Setting up GPUDirect for infiniband - cuda

I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy.
I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4
Also the Mellanox-nvidia GPUDirect plugin is installed:
-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.
According to this thread "How to use GPUDirect RDMA with Infiniband"
I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.
void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);
Requested Info:
mlx5 is used.
Last Kernel log:
[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)
Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?

A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:
export CUDA_DISABLE_UNIFIED_MEMORY=1
?
If this does not fix your problem, you should try running validation and copybw tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.

Related

Errors thrown when trying to run basic.sh in sosumi

I was hoping that you could help me. I've been stuck on this problem for quite a while.
When I try to start up the clover boot loader or run the basic.sh file, I get these errors in the terminal:
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1 [bit 19]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.2 [bit 20]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.movbe [bit 22]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.aes [bit 25]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.xsave [bit 26]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.avx [bit 28]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.bmi1 [bit 3]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.avx2 [bit 5]
etc.
I have no idea what they mean. Could you please tell me a solution? I've tried uninstalling and reinstalling manually. It didn't work and it threw these errors at me again. I followed the instructions in the readme: https://github.com/foxlet/macOS-Simple-KVM
Qemu and everything it needs, all the dependencies are installed on my computer.
When I run the clover bootloader, it just shows a bunch of text then brings me back to the menu. I hit enter again. last time i kept ending up in the shell, and I don't know why.
Why does it keep crashing? Could you tell me pls how to fix it?
This is the second time I'm struggling with this, please help.
UPDATE: I tried using this repo: https://github.com/kholia/OSX-KVM and got the same errors. It's still not working.
The shell script you're running starts QEMU asking it to provide a guest CPU with various features (including SSE4, AVX and AVX2). With KVM, the only way we can give the guest a CPU with a feature like AVX is if the host CPU has it, because we run guest code directly on the host CPU. QEMU is warning you that you asked for something it can't do, because the host CPU you're running it on doesn't have those features. QEMU removes the features it can't provide from the set of things it tells the guest about via the CPUID registers.
If the guest OS really needs a CPU with AVX2 and all the rest of it, you need to run on a newer host CPU.
If the guest OS is happy to read the CPUID registers and adjust itself to avoid using features that aren't there, then you could adjust the -cpu options the script is passing to make it request something with fewer features, but all this will do is mean that QEMU won't print the warnings -- it won't change how the guest runs on that kind of CPU.

CUDA : How to detect shared memory bank conflict on device with compute capabiliy >= 7.2?

On device with compute capability <= 7.2 , I always use
nvprof --events shared_st_bank_conflict
but when i run it on RTX2080ti with CUDA10 , it returns
Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability greater than 7.2
So how can i detect whether there's share memory bank conflict on this devices ?
I've installed Nvidia Nsight Systems and Nsight Compute , find no such profiling report...
thks
You can use --metrics:
Either
nv-nsight-cu-cli --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum
for conflicts when reading (load'ing) from shared memory, or
nv-nsight-cu-cli --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum
for conflicting when writing (store'ing) to shared memory.
It seems this is a problem, and is addressed in this post to the NVIDIA forums. Apparently it should be supported using one of the Nsight tools (either the CLI or UI).

nsight VSE debugger error "code patching failed due to lack of code patching memory"

I am having a nsight debug error like following, when I was debugging a cuda kernel using nsight. I have no idea what that means. Looks like something to do with cuFFT. But can anyone give some pointers? Thanks.
As the error message reported by Nsight cleared indicated, the error is caused by Nsight having insufficient available memory on the device to interactively debug the code you are running. Quoting from the Nsight documentation:
When the CUDA Memory Checker is enabled, it will consume extra memory
on the GPU. If there is not enough patch RAM for the CUDA Debugger, it
will give the following error:
Internal debugger error occurred while attempting to launch "KernelName - CUmodule 0x04e67f10: code patching failed due to lack of code patching memory.
If this happens, increase the patch RAM factor by going to Nsight >
Options > CUDA > Code Patching Memory Factor.
This is a multiplier of the kernel's instruction size, which is added
to a base patch RAM size of 64k.
Another option is to disable the shared or global memory checking, in
order to use less patch RAM.
The original poster noted that increasing the code patching memory factor from a ratio of 2 to 16 solved the problem.

How to prevent syslogging "Error inserting nvidia" on cudaGetDeviceCount()?

I have a tool that can be run on both, GPU and CPU. In some init-step I check cudaGetDeviceCount() for the available GPUs. If the tool is being executed on a node without video cards, this results in the following syslog message:
Sep 13 00:21:10 [...] NVRM: No NVIDIA graphics adapter found!
How can I prevent the nvidia driver from flooding my syslog server with this message? It's OK if the node doesn't have a video card, it's not that critical, so I just want to get rid of the message.
That message gets inserted into the syslog by the NVIDIA driver. So the most direct solution would be to not install the NVIDIA driver on a node that does not have a GPU.
If you need some NVIDIA driver components on that node, for example to build CUDA driver API codes on a GPU-less login node, then you will need to use some special switches during driver installation.
You can find out more about driver install switches by using the --help switch on the driver installer package.
A sequence of switches like this may do the trick:
sudo sh NVIDIA-Linux-x86_64-319.72.run --no-nvidia-modprobe --no-kernel-module --no-kernel-module-source -z

CUDA SDK examples throw various errors in multi-gpu system

I have a Dell Precision Rack running Ubuntu Precise and featuring two Tesla C2075 plus a Quadro 600 which is the display device. I have recently finished some tests on my desktop-computer and now tried to port stuff to the workstation.
Since CUDA was not present I installed it according to this guide and adapted the SDK Makefiles according to this suggestions.
What I am now facing is that not a single sample (I did test like 10 different ones) is running. Those are the errors I am getting:
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 10
-> invalid device ordinal
[deviceQuery] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
[MonteCarloMultiGPU] starting...
CUDA error at MonteCarloMultiGPU.cpp:235 code=23510 (cudaErrorInvalidDevice) "cudaGetDeviceCount(&GPU_N)"MonteCarloMultiGPU
==================
Parallelization method = threaded
Problem scaling = weak
Number of GPUs = 0
Total number of options = 0
Number of paths = 262144
main(): generating input data...
main(): starting 0 host threads...
Floating point exception (core dumped)
[reduction] starting...
reduction.cpp(124) : cudaSafeCallNoSync() Runtime API error 10 : invalid device ordinal.
[simplePrintf] starting...
simplePrintf.cu(193) : CUDA Runtime API error 10: invalid device ordinal.
As you can see most of the errors are pointing towards a problem with the cudaGetDeviceCount call which return error code 10. According to the manual the problem is:
cudaErrorInvalidDevice: This indicates that the device ordinal supplied by the user does not correspond to a valid CUDA device.
Unfortunately, the only solution I was able to find suggested to check the devices power plugs. I did that and there was nothing wrong with it. Restarting the workstation does not help either.
I'd be happy to supply more details on my configuration. Just leave a comment!
Due to the comments to my original question I was able to find a solution. I followed this guide to learn how to set up the rc.local correctly (don't forget to chmod your script).