deviceQuery program - number of multiprocessors = 0 - cuda

I have executed the deviceQuery program in the CUDA SDK. The number of mutiprocessors and cores are 0 in the file that I'm sure that is not true.
What the reasons can be?
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla C2050"
CUDA Driver Version: 4.10
CUDA Runtime Version: 4.10
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 2817982464 bytes
Number of multiprocessors: 0
Number of cores: 0
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: No
Compute mode: Default
(multiple host threads can use this device simultaneously)

Ensure that:
Uninstall all the old graphics drivers and install the latest NVIDIA graphics drivers.
Uninstall all the old CUDA toolkits and install the latest CUDA toolkit.

Make sure you update your nvidia drivers to the latest available and then reboot. If that doesn't fix it, please run the following commands and post the output:
uname -a
nvidia-smi -q
lspci
echo $LD_LIBRARY_PATH
ldd /path/to/deviceQuery
ldconfig -p
devicequery (with all output, not just the card in question)

Related

CUDA failed to launch kernel : no kernel image available for execution

i am trying to run CUDA on a rather old GPU. I tried the CUDA Samples vectorAdd which gives me the following error:
Failed to launch vectorAdd kernel (error code no kernel image is available for execution on the device)!
These are the outputs from
deviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 580"
CUDA Driver Version / Runtime Version 9.1 / 9.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1467 MBytes (1538392064 bytes)
MapSMtoCores for SM 2.0 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 2.0 is undefined. Default to use 64 Cores/SM
(16) Multiprocessors, ( 64) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1630 MHz (1.63 GHz)
Memory Clock rate: 2050 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.147 Driver Version: 390.147 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 580 Off | 00000000:03:00.0 N/A | N/A |
| 42% 48C P12 N/A / N/A | 257MiB / 1467MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Now according to the CUDA compatibility PDF
https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf
I assume I have Binary Compatibility from CUDA 9.0.176 to the GPU Driver.
For Compute Capability Support, the table does not list the 390 Driver.
Is it even possible to program CUDA on this GPU or should I get a newer one? If it is possible, what combination of driver and CUDA toolkit version do I need?
The GPU you are using is a Fermi class (compute capability 2.0) device. Support was officially removed from the CUDA toolkit when CUDA 9.0 was released in September 2017. The last release of the CUDA toolkit with Fermi support was CUDA 8.0. You will have to use that (or something even older) if you wish to use that GPU with CUDA.
[Answer assembled from comments an added as a community wiki entry to get this question off the unanswered list for the CUDA tag]

Check failed: error == cudaSuccess during training SSD

I am training SSD and I have error as
I0116 13:10:31.206343 3447 net.cpp:761] Ignoring source layer drop6
I0116 13:10:31.207219 3447 net.cpp:761] Ignoring source layer drop7
I0116 13:10:31.207229 3447 net.cpp:761] Ignoring source layer fc8
I0116 13:10:31.207233 3447 net.cpp:761] Ignoring source layer prob
F0116 13:10:31.227303 3447 parallel.cpp:130] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal
*** Check failure stack trace: ***
# 0x7f158382e5cd google::LogMessage::Fail()
# 0x7f1583830433 google::LogMessage::SendToLog()
# 0x7f158382e15b google::LogMessage::Flush()
# 0x7f1583830e1e google::LogMessageFatal::~LogMessageFatal()
# 0x7f158412f7bd caffe::DevicePair::compute()
# 0x7f15841354e0 caffe::P2PSync<>::Prepare()
# 0x7f1584135fee caffe::P2PSync<>::Run()
# 0x40af10 train()
# 0x407608 main
# 0x7f1581fbd830 __libc_start_main
# 0x407ed9 _start
# (nil) (unknown)
Aborted (core dumped)
My Graphic is Quadro4200.
./deviceQuery gives me
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro K4200"
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4034 MBytes (4230479872 bytes)
( 7) Multiprocessors, (192) CUDA Cores/MP: 1344 CUDA Cores
GPU Max Clock rate: 784 MHz (0.78 GHz)
Memory Clock rate: 2700 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K4200
Result = PASS
I can successfully test SSD library, just that I have error in training.
Is that Graphic card not powerful enough to train the library?
I found the error.
If we run this command python examples/ssd/ssd_pascal.py in ssd, the next step of training command is as follow.
gdb --args ./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0,1,2,3 2>&1 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log
this --gpu 0,1,2,3 2>&1 is giving the issue. I changed to --gpu 0 and run from the training command directly as
./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log,
then it solved.

Ethminer Ubuntu 16 not using NVIDIA GPU

I have followed instructions here and successfully build and setup geth.
Ethminer seems to work except it doesn't use the Titan X GPU and the mining rate is only 341022 H/s.
Also when I try to use the -G option ethminer says it is an invalid argument; the -G flag also doesn't appear in the ethminer help command.
Your GPU must have a minimum memory to perform mining. Upgrade to GPU you with higher memories (minimum 4GB is preferable)
The current DAG size is above (2GB). That means you cant mine with GPU with memory less than 2GB.

What utility/binary can I call to determine an nVIDIA GPU's Compute Capability?

Suppose I have a system with a single GPU installed, and suppose I've also installed a recent version of CUDA.
I want to determine what's the compute capability of my GPU. If I could compile code, that would be easy:
#include <stdio.h>
int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("%d", prop.major * 10 + prop.minor);
}
but - suppose I want to do that without compiling. Can I? I thought nvidia-smi might help me, since its lets you query all sorts of information about devices, but it seems it doesn't let you obtain the compute capability. Maybe there's something else I can do? Maybe something visible via /proc or system logs?
Edit: This is intended to run before a build, on a system which I don't control. So it must have minimal dependencies, run on a command-line and not require root privileges.
Unfortunately, it looks like the answer at the moment is "No", and that one needs to either compile a program or use a binary compiled elsewhere.
Edit: I have adapted a workaround for this issue - a self-contained bash script which compiles a small built-in C program to determine the compute capability. (It is particualrly useful to call from with CMake, but can just run independently.)
Also, I've filed a feature-requesting bug report at nVIDIA about this.
Here's the script, in a version assuming that nvcc is on your path:
//usr/bin/env nvcc --run "$0" ${1:+--run-args "${#:1}"} ; exit $?
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime_api.h>
int main(int argc, char *argv[])
{
cudaDeviceProp prop;
cudaError_t status;
int device_count;
int device_index = 0;
if (argc > 1) {
device_index = atoi(argv[1]);
}
status = cudaGetDeviceCount(&device_count);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceCount() failed: %s\n", cudaGetErrorString(status));
return -1;
}
if (device_index >= device_count) {
fprintf(stderr, "Specified device index %d exceeds the maximum (the device count on this system is %d)\n", device_index, device_count);
return -1;
}
status = cudaGetDeviceProperties(&prop, device_index);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceProperties() for device device_index failed: %s\n", cudaGetErrorString(status));
return -1;
}
int v = prop.major * 10 + prop.minor;
printf("%d\n", v);
}
We can use nvidia-smi --query-gpu=compute_cap --format=csv to get the compute capability.
Sample output:
compute_cap
8.6
It is available for cuda tool kit 11.6.
You can use deviceQuery utility included in cuda installation
# change cwd into utility source directoy
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# build deviceQuery utility with make as root
$ sudo make
# run deviceQuery
$ ./deviceQuery | grep Capability
CUDA Capability Major/Minor version number: 7.5
# optionally copy deviceQuery in ~/bin for future use
$ cp ./deviceQuery ~/bin
Full ouput from deviceQuery with RTX2080Ti is follows:
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2080 Ti"
CUDA Driver Version / Runtime Version 11.2 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11016 MBytes (11551440896 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1770 MHz (1.77 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Thanks.

cuda-gdb sees only one least capable device from four CUDA-capable devices available

There are four CUDA-capable devices available:
teslabot$ ./deviceQuery | grep -i "device [0-9]\|capability"
Device 0: "Tesla C2050 / C2070"
CUDA Capability Major/Minor version number: 2.0
Device 1: "Tesla C2050 / C2070"
CUDA Capability Major/Minor version number: 2.0
Device 2: "GeForce GTX 295"
CUDA Capability Major/Minor version number: 1.3
Device 3: "GeForce GTX 295"
CUDA Capability Major/Minor version number: 1.3
cuda-dbg sees only one of them:
teslabot$ cuda-gdb vector_add
NVIDIA (R) CUDA Debugger
4.0 release
Portions Copyright (C) 2007-2011 NVIDIA Corporation
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
[...]
(cuda-gdb) break vector_add_gpu
Breakpoint 1 at 0x400ddb: file vector_add.cu, line 7.
(cuda-gdb) run
[...]
(cuda-gdb) info cuda devices
Dev Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask
* 0 gt200 sm_13 30 32 32 128 0x00000001
I have checked that code build with -gencode arch=compute_20,code=sm_20 compiles without errors on said machine, and when compiled for sm_20 then using printf in CUDA kernel works correctly.
How can I make cuda-gdb see all devices (perhaps except one used for graphics... though in said case I am logging remotely via SSH), or at least one Tesla / sm_20 device?
When following advise in Michael Foukarakis response by setting CUDA_VISIBLE_DEVICES environment variable to contain only "0,1" i.e. make visible only Teslas, I get the following error after running info cuda devices:
(cuda-gdb) info cuda devices
fatal: All CUDA devices are used for X11 and cannot be used while debugging. (error code = 24)
How to check which devices are used by X11 (X.Org), and how to make X Window System to use GeForce and not Tesla?
Can you make sure the CUDA_VISIBLE_DEVICES environment variable contains all the devices you want to be used, such as:
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Quadro FX 3800"
By setting the variable you can make only a subset of them visible to the runtime:
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"