Caffe: Check failed: error == cudaSuccess (48 vs. 0) no kernel image is available for execution on the device error on Jetson TX1 - caffe

I was able to successfully compile Caffe on a Nvidia Jetson TX1 board with CUDA 9.0 and Open CV 3.
However, when I run the following command to test Caffe:
build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0
I get the following error:
F0712 23:05:53.664676 28580 im2col.cu:61] Check failed: error == cudaSuccess (48 vs. 0) no kernel image is available for execution on the device
If I remove the --gpu=0 flag, I don't see the error anymore.
Any help/suggestions on how I can get the code to use the GPU will be much appreciated.

make sure you have the cuda arch number correct in the MakeFile.config with respect to the jetson architecture, sth like this:
-gencode arch=compute_72,code=sm_72 \

Related

Unsupported gpu architecture compute_30 on a CUDA 5 capable gpu

I'm currently trying to compile Darknet on the latest CUDA toolkit which is version 11.1. I have a GPU capable of running CUDA version 5 which is a GeForce 940M. However, while rebuilding darknet using the latest CUDA toolkit, it said
nvcc fatal : Unsupported GPU architecture 'compute_30'
compute_30 is for version 3, how can it fail while my GPU can run version 5
Is it possible that my code detected my intel graphics card instead of my Nvidia GPU? if that's the case, is it possible to change its detection?
Support for compute_30 has been removed for versions after CUDA 10.2. So if you are using nvcc make sure to use this flag to target the correct architecture in the build system for darknet
-gencode=arch=compute_50,code=sm_50
You may also need to use this one to avoid a warning of architectures are deprecated.
-Wno-deprecated-gpu-targets
I added the following:
makefiletemp = open('Makefile','r+')
list_of_lines = makefiletemp.readlines()
list_of_lines[15] = list_of_lines[14]
list_of_lines[16] = "ARCH= -gencode arch=compute_35,code=sm_35 \\\n"
makefiletemp = open('Makefile','w')
makefiletemp.writelines(list_of_lines)
makefiletemp.close()
right before the
#Compile Darknet
!make
command. That seemed to work!

cuda-gdb run error:"Failed to read the valid warps mask(dev=0,sm=0,error=16)" [duplicate]

I tried to debug my CUDA application with cuda-gdb but got some weird error.
I set option -g -G -O0 to build my application. I could run my program without cuda-gdb, but didn't get correct result. Hence I decided to use cuda-gdb, however, I got following error message while running program with cuda-gdb
Error: Failed to read the valid warps mask (dev=1, sm=0, error=16).
What does it means? Why sm=0 and what's the meaning of error=16?
Update 1: I tried to use cuda-gdb to CUDA samples, but it fails with same problem. I just installed CUDA 6.0 Toolkit followed by instruction of NVIDIA. Is it a problem of my system?
Update 2:
OS - CentOS 6.5
GPU
1 Quadro 400
2 Tesla C2070
I'm using only 1 GPU for my program, but I've got same bug message from any GPU that I selected
CUDA version - 6.0
GPU Driver
NVRM version: NVIDIA UNIX x86_64 Kernel Module 331.62 Wed Mar 19 18:20:03 PDT 2014
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
Update 3:
I tried to get more information in cuda-gdb, but I got following results
(cuda-gdb) info cuda devices
Error: Failed to read the valid warps mask (dev=1, sm=0, error=16).
(cuda-gdb) info cuda sms
Focus not set on any active CUDA kernel.
(cuda-gdb) info cuda lanes
Focus not set on any active CUDA kernel.
(cuda-gdb) info cuda kernels
No CUDA kernels.
(cuda-gdb) info cuda contexts
No CUDA contexts.
Actually, this issue is only specific to some old NVIDIA GPUs(like "Quadro 400", "GeForce GT220", or "GeForce GT 330M", etc).
On Liam Kim's setup, cuda-gdb should work fine by set environment variable "CUDA_VISIBLE_DEVICES", and let cuda-gdb running on Tesla C2070 GPUs specifically.
I.e
$export CUDA_VISIBLE_DEVICES=0 (or 2)
- the exact CUDA devices index could be found by running cuda sample - "deviceQuery".
And now, this issue has been fixed, the fix would be availble for CUDA developers in the next CUDA release(it will be posted out around early July, 2014).
This is internal cuda-gdb bug. You should report a bug.
Can you try installing CUDA toolkit from the package on NVIDIA site?

How to force cubin file generation for a higher compute version

In the samples provided with CUDA 6.0, I'm running the following compile command with error output:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc cdpSimpleQuicksort.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
cdpSimpleQuicksort.cu(105): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
cdpSimpleQuicksort.cu(114): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
2 errors detected in the compilation of "/tmp/tmpxft_0000241a_00000000-6_cdpSimpleQuicksort.cpp1.ii".
I then altered the command to this, with a new failure:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc -gencode arch=compute_35,code=sm_35 cdpSimpleQuicksort.cu
cdpSimpleQuicksort.cu(105): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
cdpSimpleQuicksort.cu(114): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
2 errors detected in the compilation of "/tmp/tmpxft_000024f3_00000000-6_cdpSimpleQuicksort.cpp1.ii".
Does this have anything to do with the fact that the machine I'm on is only Compute 2.1 capable and the build tools are blocking me? What's the resolution... I'm not finding anything in the documentation that is clearly handling this error.
I looked at this question, and that... a link to documentation is simply not helping. I need to know how I have to modify the compile command.
Look at the makefile that comes with that cdpSimpleQuicksort project. It shows some additional switches that are needed to compile it, due to CUDA dynamic parallelism (which is essentially the second set of errors you are seeing.) Go back and study that makefile, and see if you can figure out how to combine some of the compile commands there with --cubin.
The readers digest version is that this should compile without error:
nvcc --cubin -rdc=true -I../../common/inc -arch=sm_35 cdpSimpleQuicksort.cu
Having said all that, you should be able to compile for whatever kind of target you want, but you won't be able to run a cdp code on a cc2.1 architecture.
cdp documentation
and here

CUDA: Why does compute_20 code fail on compute_35 device?

For a computer with Titan GPU (compute_35,sm_35), I compiled some code using this line in CMakeLists.txt:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_35,code=sm_35)
The code compiles and also runs fine.
I wanted to check what compilation problems this code would cause for a friend who uses a GTS 450 (compute_20,sm_21). So, I changed the above line to:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=sm_21)
The code compiles without any errors on my computer with Titan. But when I run it (again on my Titan computer), its fails after a thrust::copy call with the following error:
$ ./foobar
terminate called after throwing an instance of 'thrust::system::system_error'
what(): invalid device function
"foobar" terminated by signal SIGABRT (Abort)
Google says the above error is caused due to GPU architecture mismatch.
The strangest part is that with the above line (arch=compute_20,code=sm_21), the code compiles and runs without error on my friend's computer with GTS 450! Except for the GPU, her Ubuntu 12.04, gcc and CUDA SDK 5.5 versions are the same as mine.
Is this the real cause of this error? Why cannot Titan run compute_20 code? Isn't a CUDA GPU supposed to be backwards compatible with PTX or SASS code? Even if it isn't, why cannot the driver JIT compile the compute_20 PTX to the SASS of sm_35?
If you specify:
-gencode arch=compute_20,code=compute_20
your code should run (via JIT) on either GPU.
According to the nvcc manual, JIT is directly enabled when you specify a virtual architecture for the code switch. You can make multiple specifications in a single command:
-arch=compute_20 -code=compute20,sm_21,sm_35
(note this is in lieu of specifying -gencode ...)
which would allow JIT from sm_20 PTX, and non-JIT execution directly on cc2.1 or cc3.5 devices.

CUDA SDK examples throw various errors in multi-gpu system

I have a Dell Precision Rack running Ubuntu Precise and featuring two Tesla C2075 plus a Quadro 600 which is the display device. I have recently finished some tests on my desktop-computer and now tried to port stuff to the workstation.
Since CUDA was not present I installed it according to this guide and adapted the SDK Makefiles according to this suggestions.
What I am now facing is that not a single sample (I did test like 10 different ones) is running. Those are the errors I am getting:
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 10
-> invalid device ordinal
[deviceQuery] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
[MonteCarloMultiGPU] starting...
CUDA error at MonteCarloMultiGPU.cpp:235 code=23510 (cudaErrorInvalidDevice) "cudaGetDeviceCount(&GPU_N)"MonteCarloMultiGPU
==================
Parallelization method = threaded
Problem scaling = weak
Number of GPUs = 0
Total number of options = 0
Number of paths = 262144
main(): generating input data...
main(): starting 0 host threads...
Floating point exception (core dumped)
[reduction] starting...
reduction.cpp(124) : cudaSafeCallNoSync() Runtime API error 10 : invalid device ordinal.
[simplePrintf] starting...
simplePrintf.cu(193) : CUDA Runtime API error 10: invalid device ordinal.
As you can see most of the errors are pointing towards a problem with the cudaGetDeviceCount call which return error code 10. According to the manual the problem is:
cudaErrorInvalidDevice: This indicates that the device ordinal supplied by the user does not correspond to a valid CUDA device.
Unfortunately, the only solution I was able to find suggested to check the devices power plugs. I did that and there was nothing wrong with it. Restarting the workstation does not help either.
I'd be happy to supply more details on my configuration. Just leave a comment!
Due to the comments to my original question I was able to find a solution. I followed this guide to learn how to set up the rc.local correctly (don't forget to chmod your script).