I'm trying to run nvprof profiler to see where my program is spending more time.
But i always get this error:
======== NVPROF is profiling gpu_stuff...
======== Command: gpu_stuff
======== Error: Internal profiler error 15:120======== Warning: Application returned non-zero code 255
======== Error: failed to read result file.
======== Warning: make sure cudaDeviceReset() is called before application exit to flush profile data.
I'm calling cudaDeviceReset() at the end of code and it is not working yet.
Obs: I have no X-server disponible, i need to use the profiler in command-line.
Thanks to Yu Zhou
It's because of your CUDA toolkit version not compatible with your driver version.
Related
GPU Tesla M60
Driver: 510.47.03
OSL Ubuntu 20.04.5 LTS
CUDA Version: 11.6
Trying the code below to get back full metrics on profiling a CUDA application results in teh error below.
Code
nvprof --metrics all ./myapp
Error
==8169== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
I tried using sudo as suggested but was unable to find the nvcc program.
The easiest solution is to run the profiler as root as below, noting that it may be necessary to use the fully qualified path to find nvcc if it is not in your sudo path.
sudo /usr/local/cuda/bin/nvprof --metrics all ./myapp
There are more permanent solutions available as per https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters such as changing permission settings with modprobe. However, I was not able to get these to work.
The nvidia-smi command correctly executes, showing the expected GPU devices for my server. However, when I attempt to run the clock CUDA sample, I get the following error:
CUDA Clock sample
CUDA error at ../../common/inc/helper_cuda.h:1133 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
Any ideas?
Ah! Figured it out. I had to upgrade the version of CentOS we are using. The unknown error indicated that a dependent library mismatch was the problem.
To fix, I upgraded the OS and then reinstalled the drivers
For a computer with Titan GPU (compute_35,sm_35), I compiled some code using this line in CMakeLists.txt:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_35,code=sm_35)
The code compiles and also runs fine.
I wanted to check what compilation problems this code would cause for a friend who uses a GTS 450 (compute_20,sm_21). So, I changed the above line to:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=sm_21)
The code compiles without any errors on my computer with Titan. But when I run it (again on my Titan computer), its fails after a thrust::copy call with the following error:
$ ./foobar
terminate called after throwing an instance of 'thrust::system::system_error'
what(): invalid device function
"foobar" terminated by signal SIGABRT (Abort)
Google says the above error is caused due to GPU architecture mismatch.
The strangest part is that with the above line (arch=compute_20,code=sm_21), the code compiles and runs without error on my friend's computer with GTS 450! Except for the GPU, her Ubuntu 12.04, gcc and CUDA SDK 5.5 versions are the same as mine.
Is this the real cause of this error? Why cannot Titan run compute_20 code? Isn't a CUDA GPU supposed to be backwards compatible with PTX or SASS code? Even if it isn't, why cannot the driver JIT compile the compute_20 PTX to the SASS of sm_35?
If you specify:
-gencode arch=compute_20,code=compute_20
your code should run (via JIT) on either GPU.
According to the nvcc manual, JIT is directly enabled when you specify a virtual architecture for the code switch. You can make multiple specifications in a single command:
-arch=compute_20 -code=compute20,sm_21,sm_35
(note this is in lieu of specifying -gencode ...)
which would allow JIT from sm_20 PTX, and non-JIT execution directly on cc2.1 or cc3.5 devices.
I have a Dell Precision Rack running Ubuntu Precise and featuring two Tesla C2075 plus a Quadro 600 which is the display device. I have recently finished some tests on my desktop-computer and now tried to port stuff to the workstation.
Since CUDA was not present I installed it according to this guide and adapted the SDK Makefiles according to this suggestions.
What I am now facing is that not a single sample (I did test like 10 different ones) is running. Those are the errors I am getting:
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 10
-> invalid device ordinal
[deviceQuery] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
[MonteCarloMultiGPU] starting...
CUDA error at MonteCarloMultiGPU.cpp:235 code=23510 (cudaErrorInvalidDevice) "cudaGetDeviceCount(&GPU_N)"MonteCarloMultiGPU
==================
Parallelization method = threaded
Problem scaling = weak
Number of GPUs = 0
Total number of options = 0
Number of paths = 262144
main(): generating input data...
main(): starting 0 host threads...
Floating point exception (core dumped)
[reduction] starting...
reduction.cpp(124) : cudaSafeCallNoSync() Runtime API error 10 : invalid device ordinal.
[simplePrintf] starting...
simplePrintf.cu(193) : CUDA Runtime API error 10: invalid device ordinal.
As you can see most of the errors are pointing towards a problem with the cudaGetDeviceCount call which return error code 10. According to the manual the problem is:
cudaErrorInvalidDevice: This indicates that the device ordinal supplied by the user does not correspond to a valid CUDA device.
Unfortunately, the only solution I was able to find suggested to check the devices power plugs. I did that and there was nothing wrong with it. Restarting the workstation does not help either.
I'd be happy to supply more details on my configuration. Just leave a comment!
Due to the comments to my original question I was able to find a solution. I followed this guide to learn how to set up the rc.local correctly (don't forget to chmod your script).
I get a problem in running example of GPMR (a MapReduce framework). I have successfully compiled the examples contained in the framework. But when I run the examples, I get the following error:
Fatal error in MPI_Comm_rank: Invalid communicator, error stack:
MPI_Comm_rank(106): MPI_Comm_rank(comm=0x8099680, rank=0x97ba5c8) failed
MPI_Comm_rank(64).: Invalid communicator
The commands I issued include "./matmul", "mpiexec -np 2 ./matmul", "mpirun -np 2 ./matmul", where "matmul" is a binary file of a matrix multiply example. and all of them have the same error.
Your answer would be highly appreciated. I am looking forward to your helpful advice.
Regards,
Jay
I've solved the problem by compiling the program via the same mpicxx. Previously, I compiled a lib by ...\bin\mipcxx, while the program via mpicx. That is the problem.