ERR_NVGPUCTRPERM error when launching nvprof with all metrics to profile CUDA application - cuda

GPU Tesla M60
Driver: 510.47.03
OSL Ubuntu 20.04.5 LTS
CUDA Version: 11.6
Trying the code below to get back full metrics on profiling a CUDA application results in teh error below.
Code
nvprof --metrics all ./myapp
Error
==8169== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
I tried using sudo as suggested but was unable to find the nvcc program.

The easiest solution is to run the profiler as root as below, noting that it may be necessary to use the fully qualified path to find nvcc if it is not in your sudo path.
sudo /usr/local/cuda/bin/nvprof --metrics all ./myapp
There are more permanent solutions available as per https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters such as changing permission settings with modprobe. However, I was not able to get these to work.

Related

Errors thrown when trying to run basic.sh in sosumi

I was hoping that you could help me. I've been stuck on this problem for quite a while.
When I try to start up the clover boot loader or run the basic.sh file, I get these errors in the terminal:
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1 [bit 19]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.2 [bit 20]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.movbe [bit 22]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.aes [bit 25]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.xsave [bit 26]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.avx [bit 28]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.bmi1 [bit 3]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.avx2 [bit 5]
etc.
I have no idea what they mean. Could you please tell me a solution? I've tried uninstalling and reinstalling manually. It didn't work and it threw these errors at me again. I followed the instructions in the readme: https://github.com/foxlet/macOS-Simple-KVM
Qemu and everything it needs, all the dependencies are installed on my computer.
When I run the clover bootloader, it just shows a bunch of text then brings me back to the menu. I hit enter again. last time i kept ending up in the shell, and I don't know why.
Why does it keep crashing? Could you tell me pls how to fix it?
This is the second time I'm struggling with this, please help.
UPDATE: I tried using this repo: https://github.com/kholia/OSX-KVM and got the same errors. It's still not working.
The shell script you're running starts QEMU asking it to provide a guest CPU with various features (including SSE4, AVX and AVX2). With KVM, the only way we can give the guest a CPU with a feature like AVX is if the host CPU has it, because we run guest code directly on the host CPU. QEMU is warning you that you asked for something it can't do, because the host CPU you're running it on doesn't have those features. QEMU removes the features it can't provide from the set of things it tells the guest about via the CPUID registers.
If the guest OS really needs a CPU with AVX2 and all the rest of it, you need to run on a newer host CPU.
If the guest OS is happy to read the CPUID registers and adjust itself to avoid using features that aren't there, then you could adjust the -cpu options the script is passing to make it request something with fewer features, but all this will do is mean that QEMU won't print the warnings -- it won't change how the guest runs on that kind of CPU.

Why am I getting this cuda out of memory error on my new computer and not on the old one?

I'm currently trying to extract SIFT Features with the following package:
https://github.com/Celebrandil/CudaSift
It comes with a CMakeLists.txt, which I modified, here it is:
cmake_minimum_required(VERSION 2.6)
project(cudaSift)
set(cudaSift_VERSION_MAJOR 2)
set(cudaSift_VERSION_MINOR 0)
set(cudaSift_VERSION_PATCH 0)
set(CPACK_PACKAGE_VERSION_MAJOR "${cudaSift_VERSION_MAJOR}")
set(CPACK_PACKAGE_VERSION_MINOR "${cudaSift_VERSION_MINOR}")
set(CPACK_PACKAGE_VERSION_PATCH "${cudaSift_VERSION_PATCH}")
set(CPACK_GENERATOR "ZIP")
include(CPack)
find_package(OpenCV REQUIRED)
find_package(CUDA)
if (NOT CUDA_FOUND)
message(STATUS "CUDA not found. Project will not be built.")
endif(NOT CUDA_FOUND)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -msse2 ")
list(APPEND CUDA_NVCC_FLAGS "-lineinfo;-ccbin;/usr/bin/gcc-7;--compiler-options;-O2;-D_FORCE_INLINES;-DVERBOSE_NOT; -arch=sm_75")
cuda_add_library(cudaSift SHARED
src/cudaImage.cu
src/cudaSiftH.cu
src/matching.cu
src/geomFuncs.cpp
src/mainSift.cpp
)
target_link_libraries(cudaSift ${CUDA_cudadevrt_LIBRARY} ${OpenCV_LIBS})
set(PUBLIC_HEADERS include/cudaImage.h include/cudaSift.h)
set_target_properties(cudaSift PROPERTIES PUBLIC_HEADER
"${PUBLIC_HEADERS}"
)
include(GNUInstallDirs)
install(TARGETS cudaSift
LIBRARY DESTINATION "${CMAKE_INSTALL_LIBDIR}"
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
configure_file(cudaSift.pc.in cudaSift.pc #ONLY)
install(FILES ${CMAKE_BINARY_DIR}/cudaSift.pc DESTINATION ${CMAKE_INSTALL_DATAROOTDIR}/pkgconfig)
My GPU is a GeForce RTX 2060, driver version 430.5, and after running:
mkdir build && cd build
cmake ..
sudo make -j
sudo make install
sudo ldconfig
-in order to build the package, I try to run my code and get the following error:
safeCall() Runtime API error in file </path/to/CudaSift/src/cudaImage.cu>, line 24 : out of memory.
Precisions:
I run the exact same code on another computer which has a GeForce GTX 1050, only changing in CMakeLists.txt -arch=sm_75 to arch=sm_61 and it executes just fine.
From previous questions, I think this is a compilation problem, linked to the arch=sm_** value, but I changed it and it still doesn't work.
The objects i'm passing to my GPU are images, which I'm sure aren't too big, since it works on my other computer which GPU has less memory
UPDATE:
I found the problem, the package was actually compiled properly.
Actually A tensorflow model was loaded in the code, but after deleting it, the error didn't happen again.
I don't know why though, maybe it reserved a lot of GPU memory ?

How to Install the CUDA Driver for TensorFlow (installing from source)

I'm trying to build TensorFlow from source and run it with GPU support. To install the toolkit I use the runfile, to install the driver I used the Additional Drivers Tool, since I did not get Ubuntu to boot into Text mode as specified in the CUDA documentation and stop lightdm and start lightdm does not work either, it gives me (also with sudo):
Name com.ubuntu.Upstart does not exist
So far I could build a release from the TensorFlow repository. However, when I'm trying to run the example as specified in the how-to
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
the GPU apparently cannot be found:
jonas#jonas-Aspire-V5-591G:~/Documents/repos/tensoflow_fork$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version
4.9.2 (Ubuntu 4.9.2-10ubuntu13) """
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])
Aborted
I'm using a clean Ubuntu 15.04 installation on an Acer Notebook with the GTX950M.
Can anybody tell me how to properly install the driver?
Can you run deviceQuery (comes with cuda installation)? Can you see nvidia present in lspci/lsmod/nvidia-smi?
lsmod |grep nvidia
dmesg | grep -i nvidia
lspci | grep -i nvidia
nvidia-smi
You can reload nvidia module and look for error messages
modprobe -r nvidia
dmesg | tail
sudo dmesg | grep NVRM
Related issue https://github.com/tensorflow/tensorflow/issues/601

CUDA 7.0 Error while compiling samples

I'm trying to install CUDA 7.0 on Ubuntu 14.04. I've followed the installation instructions as outlined here. Specifically, I've followed steps in section 3.6 and Chapter 6. While compiling the examples (Section 6.2.2.2) using make, I'm getting the following error:
make[1]: Entering directory `/usr/local/cuda-7.0/samples/3_Imaging/cudaDecodeGL'
/usr/local/cuda-7.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_20,
code=compute_20 -o cudaDecodeGL FrameQueue.o ImageGL.o VideoDecoder.o
VideoParser.o VideoSource.o cudaModuleMgr.o cudaProcessFrame.o
videoDecodeGL.o -L../../common/lib/linux/x86_64 -L/usr/lib/"nvidia-346"
-lGL -lGLU -lX11 -lXi -lXmu -lglut -lGLEW -lcuda -lcudart -lnvcuvid
/usr/bin/ld: cannot find -lnvcuvid
collect2: error: ld returned 1 exit status
make[1]: *** [cudaDecodeGL] Error 1
make[1]: Leaving directory `/usr/local/cuda-7.0/samples/3_Imaging/cudaDecodeGL'
make: *** [3_Imaging/cudaDecodeGL/Makefile.ph_build] Error 2
If you notice, there is -L/usr/lib/"nvidia-346". In my case, I have installed nvidia-349. What worked for me is to edit NVIDIA_CUDA-7.0_Samples/3_Imaging/cudaDecodeGL/findgllib.mk and change UBUNTU_PKG_NAME = "nvidia-346" to nvidia-349.
In order to properly install CUDA 7.0 on Ubuntu 14.04, you need a nvidia driver version 346 or higher.
If you're using the .deb installation method, the nvidia graphics driver is installed automatically.
If you used the .run file installation method and chose not to install the nvidia driver, you can manually install the driver afterwards through the package manager:
sudo apt-add-repository ppa:xorg-edgers/ppa && sudo apt-get update
sudo apt-get install nvidia-346 nvidia-346-dev nvidia-346-uvm libcuda1-346 nvidia-libopencl1-346 nvidia-icd-346
In my case, I installed nvidia-352 afterwards due to a bug in nvidia-346 and I stumbled upon the same error.
andoum's approach of manually changing the hard-coded UBUNTU_PKG_NAME = "nvidia-346" to UBUNTU_PKG_NAME = "nvidia-352" in NVIDIA_CUDA-7.0_Samples/3_Imaging/cudaDecodeGL/findgllib.mk worked fine for me.
I met the same issue and solution is that put path of nvidia into system path:
sudo gedit /etc/environment
add these path into environment
LIBRARY_PATH=/usr/lib/your_nvidia_edition:$LIBRARY_PATH
In fact I have encountered this problem when I made a make. I installed Cuda 8.0 under my Ubuntu 16.04. This problem had been confusing me for several weeks and I was almost tending to reinstall ubuntu for that after reviewing many suggestions via google, but finally I addressed it myself recently.
First of all, you should replace all the UBUNTU_PKG_NAME= ##nvidia-3xx## to the one of your actually installed nvidia driver version as recommended above. Then you will probably get compiling error after you do a new make. In my case, I have the link errors like
/usr/bin/ld: warning: libGLX.so.0, needed by /usr/lib/nvidia-
375/libGL.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libGLdispatch.so.0, needed by /usr/lib/nvidia-
375/libGL.so, not found (try using -rpath or -rpath-link)
....
or whatever contains missing link errors. Do locate the files you miss like
$ locate libGLX.so.
/usr/lib/nvidia-375/libGLX.so.0
/usr/lib32/nvidia-375/libGLX.so.0
$ locate libGLdispatch.so.0
/usr/lib/nvidia-375/libGLdispatch.so.0
/usr/lib32/nvidia-375/libGLdispatch.so.0
The error above is probably caused the compiling files cannot find in the default cuda libraries as you set, so you just need to copy the missing files to /usr/lib/nvidia-3xx/ (the actual path in your case) and this should work(it works in my case), if it doesn't maybe you could try to link the new add files to the one that need using a
$ sudo ln -s (requested file) (requesting file).
Hope this will help.

How to prevent syslogging "Error inserting nvidia" on cudaGetDeviceCount()?

I have a tool that can be run on both, GPU and CPU. In some init-step I check cudaGetDeviceCount() for the available GPUs. If the tool is being executed on a node without video cards, this results in the following syslog message:
Sep 13 00:21:10 [...] NVRM: No NVIDIA graphics adapter found!
How can I prevent the nvidia driver from flooding my syslog server with this message? It's OK if the node doesn't have a video card, it's not that critical, so I just want to get rid of the message.
That message gets inserted into the syslog by the NVIDIA driver. So the most direct solution would be to not install the NVIDIA driver on a node that does not have a GPU.
If you need some NVIDIA driver components on that node, for example to build CUDA driver API codes on a GPU-less login node, then you will need to use some special switches during driver installation.
You can find out more about driver install switches by using the --help switch on the driver installer package.
A sequence of switches like this may do the trick:
sudo sh NVIDIA-Linux-x86_64-319.72.run --no-nvidia-modprobe --no-kernel-module --no-kernel-module-source -z