CUDA SDK examples throw various errors in multi-gpu system - cuda

I have a Dell Precision Rack running Ubuntu Precise and featuring two Tesla C2075 plus a Quadro 600 which is the display device. I have recently finished some tests on my desktop-computer and now tried to port stuff to the workstation.
Since CUDA was not present I installed it according to this guide and adapted the SDK Makefiles according to this suggestions.
What I am now facing is that not a single sample (I did test like 10 different ones) is running. Those are the errors I am getting:
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 10
-> invalid device ordinal
[deviceQuery] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
[MonteCarloMultiGPU] starting...
CUDA error at MonteCarloMultiGPU.cpp:235 code=23510 (cudaErrorInvalidDevice) "cudaGetDeviceCount(&GPU_N)"MonteCarloMultiGPU
==================
Parallelization method = threaded
Problem scaling = weak
Number of GPUs = 0
Total number of options = 0
Number of paths = 262144
main(): generating input data...
main(): starting 0 host threads...
Floating point exception (core dumped)
[reduction] starting...
reduction.cpp(124) : cudaSafeCallNoSync() Runtime API error 10 : invalid device ordinal.
[simplePrintf] starting...
simplePrintf.cu(193) : CUDA Runtime API error 10: invalid device ordinal.
As you can see most of the errors are pointing towards a problem with the cudaGetDeviceCount call which return error code 10. According to the manual the problem is:
cudaErrorInvalidDevice: This indicates that the device ordinal supplied by the user does not correspond to a valid CUDA device.
Unfortunately, the only solution I was able to find suggested to check the devices power plugs. I did that and there was nothing wrong with it. Restarting the workstation does not help either.
I'd be happy to supply more details on my configuration. Just leave a comment!

Due to the comments to my original question I was able to find a solution. I followed this guide to learn how to set up the rc.local correctly (don't forget to chmod your script).

Related

Trying to decr ref count of Tcl_Obj allocated in another thread

I compiled the sourceforge tcl executable, it passes all the tests supplied, and it runs with the same segfault I see in my downloaded executable, 8.6.9. I'm running on Ubuntu 16.04 (for legacy reasons) on an AMD processor. ( I have run on ubuntu 18.04 on my laptop, it has the same failure. )
So, next I recompiled with "--enable-symbols=mem" turned on to see if a memory leak is causing the segfault, and now it fails immediately with:
Trying to decr ref count of Tcl_Obj allocated in another thread
./runMeg.sh: line 3: 29972 Aborted (core dumped) ../source/main_megatron.
I'm not seeing any answer on what to do with this response, can someone advise on what this means I need to fix?
All my threads are of the form:
set graphDisplayThread [ thread::create {
after [expr {int(1000) }]
.....
puts "...Initialized graphDisplayUpdate_02 ID $c update."
thread::wait
}]
and:
thread::send $::graphDisplayThread {
incr b
graphDisplayUpdate .c
}
All shared variables are referenced AFTER mutex is captured, and through TSV variables. There are 5 threads in the application, which has no C-code in it at all. Around 2000 lines of code, in total.
The app runs thousands of cycles and then segfaults at random points with a prior ActiveState 8.6.9 pre-compiled version. So, now I'm trying to isolate the failure point with compiled SourceForge 8.6.9 memory checks as a first step, but the issue above is the first one I encounter - and it occurs immediately after starting.
Update (5/16/19 8:28 EST): New Detail to answer comments below.... This application has no C-code in it, and the Tcl_Obj error ONLY appears in the sourceForge-based, 8.6.9 compiles (2) I did myself, not the ActiveState 8.6.9 pre-built download. And the error in the sourceForge code occurs in both the twin "MEM_DEBUG" and NO-"MEM_DEBUG" builds I made in tandem and tested. Both passed all install tests.
To summarize:
sourceForge 8.6.9 compile w/MEM_DEBUG option: Tcl_Obj Abort error
sourceForge 8.6.9 compile w/o MEM_DEBUG option: Tcl_Obj Abort error
ActiveState 8.6.9 build: does not Abort, random seg fault
Why should I trust the sourceForge build I made myself, more than the ActiveState pre-built executable which does not exhibit the problem? And if we do trust the sourceForge compiled version, how do I isolate where the TclObject error is created by the offending TCL code?
Update 5/16/19#13:34EST: The same segfault appears with ActiveState 8.6.9 on Ubuntu 18.04. Haven't checked my builds of SourceForge yet to see how they behave.
By methodically hacking out code blocks and watching to see if the Tcl_Obj error dissappeared or not, I found 2 errors:
I had re-declared my mutex and cond variables more than once. Now it is declared once and referenced in all other places.
A code remnant removing a TSV was found in a place I no longer wanted it.
This also fixed the segfault.
Thanks for all the help and hints, mrcalvin.

multi process multi GPU with tensorflow, windows

I'm little bit new with tensor-flow.. so please be gentle with me..
I have problem with creating second process that load tensorflow on already working GPU.
the error I get is:
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
\cuda\cuda_dnn.cc:392] error retrieving driver version: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
\cuda\cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
\kernels\conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
Hardware details :
super micro - 4028GR-TRT
8 GPU's 1080
CUDA: 8
cudnn: 5.1
windows: 10
tensorflow: 0.12.1 / 1.0.1
My PC shouldn't be a problem
windows 7
gpu 1070
cuda 8
cudnn 5.1
tensorflow 0.12.1
Can someone tell me why on my PC everything is ok but not on the big one(supermicro)?
is this windows / driver issues maybe?
I try to update NVIDIA driver.. no help on that ..
TensorFlow is not always good at sharing GPUs with other processes (including other instances of itself!). The typical workaround is to use the %CUDA_VISIBLE_DEVICES% environment variable to prevent the two processes from clashing over the same GPU. For example:
C:\>set CUDA_VISIBLE_DEVICES=0
C:\>python tensorflow_program_1.py
While in another command prompt you could tell TensorFlow to use a different GPU as follows:
C:\>set CUDA_VISIBLE_DEVICES=1
C:\>python tensorflow_program_2.py

Setting up GPUDirect for infiniband

I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy.
I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4
Also the Mellanox-nvidia GPUDirect plugin is installed:
-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.
According to this thread "How to use GPUDirect RDMA with Infiniband"
I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.
void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);
Requested Info:
mlx5 is used.
Last Kernel log:
[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)
Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?
A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:
export CUDA_DISABLE_UNIFIED_MEMORY=1
?
If this does not fix your problem, you should try running validation and copybw tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.

nsight VSE debugger error "code patching failed due to lack of code patching memory"

I am having a nsight debug error like following, when I was debugging a cuda kernel using nsight. I have no idea what that means. Looks like something to do with cuFFT. But can anyone give some pointers? Thanks.
As the error message reported by Nsight cleared indicated, the error is caused by Nsight having insufficient available memory on the device to interactively debug the code you are running. Quoting from the Nsight documentation:
When the CUDA Memory Checker is enabled, it will consume extra memory
on the GPU. If there is not enough patch RAM for the CUDA Debugger, it
will give the following error:
Internal debugger error occurred while attempting to launch "KernelName - CUmodule 0x04e67f10: code patching failed due to lack of code patching memory.
If this happens, increase the patch RAM factor by going to Nsight >
Options > CUDA > Code Patching Memory Factor.
This is a multiplier of the kernel's instruction size, which is added
to a base patch RAM size of 64k.
Another option is to disable the shared or global memory checking, in
order to use less patch RAM.
The original poster noted that increasing the code patching memory factor from a ratio of 2 to 16 solved the problem.

cuda-gdb run error:"Failed to read the valid warps mask(dev=0,sm=0,error=16)" [duplicate]

I tried to debug my CUDA application with cuda-gdb but got some weird error.
I set option -g -G -O0 to build my application. I could run my program without cuda-gdb, but didn't get correct result. Hence I decided to use cuda-gdb, however, I got following error message while running program with cuda-gdb
Error: Failed to read the valid warps mask (dev=1, sm=0, error=16).
What does it means? Why sm=0 and what's the meaning of error=16?
Update 1: I tried to use cuda-gdb to CUDA samples, but it fails with same problem. I just installed CUDA 6.0 Toolkit followed by instruction of NVIDIA. Is it a problem of my system?
Update 2:
OS - CentOS 6.5
GPU
1 Quadro 400
2 Tesla C2070
I'm using only 1 GPU for my program, but I've got same bug message from any GPU that I selected
CUDA version - 6.0
GPU Driver
NVRM version: NVIDIA UNIX x86_64 Kernel Module 331.62 Wed Mar 19 18:20:03 PDT 2014
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
Update 3:
I tried to get more information in cuda-gdb, but I got following results
(cuda-gdb) info cuda devices
Error: Failed to read the valid warps mask (dev=1, sm=0, error=16).
(cuda-gdb) info cuda sms
Focus not set on any active CUDA kernel.
(cuda-gdb) info cuda lanes
Focus not set on any active CUDA kernel.
(cuda-gdb) info cuda kernels
No CUDA kernels.
(cuda-gdb) info cuda contexts
No CUDA contexts.
Actually, this issue is only specific to some old NVIDIA GPUs(like "Quadro 400", "GeForce GT220", or "GeForce GT 330M", etc).
On Liam Kim's setup, cuda-gdb should work fine by set environment variable "CUDA_VISIBLE_DEVICES", and let cuda-gdb running on Tesla C2070 GPUs specifically.
I.e
$export CUDA_VISIBLE_DEVICES=0 (or 2)
- the exact CUDA devices index could be found by running cuda sample - "deviceQuery".
And now, this issue has been fixed, the fix would be availble for CUDA developers in the next CUDA release(it will be posted out around early July, 2014).
This is internal cuda-gdb bug. You should report a bug.
Can you try installing CUDA toolkit from the package on NVIDIA site?