CUDA Peer-to-Peer Memory Access on RTX2080 - cuda

I have four RTX2080 GPUs and I want to enable peer access from device 1 to device 0 in following code.
cudaSetDevice(0);
float* data;
cudaMalloc(&data, 1000 * sizeof(float));
cudaSetDevice(1);
cudaDeviceEnablePeerAccess(0, 0); // This will fail with error: cudaErrorPeerAccessUnsupported
I have checked unifiedAddressing of cudaDeviceProp and the value is 1. Is there anything wrong with my code?
Here is the topology of my GPU connection:
GPU0 GPU1 GPU2 GPU3
GPU0 X NODE SYS SYS
GPU1 NODE X SYS SYS
GPU2 SYS SYS X NODE
GPU3 SYS SYS NODE X
The Driver Version: 430.40
The CUDA Version: 10.1

Turning a comment into an answer:
Peer-to-peer memory access on the RTX2080 is only supported when the nvlink bridge hardware is installed between GPUs. That is why you receive an unsupported error in this case.

Related

CUDA failed to launch kernel : no kernel image available for execution

i am trying to run CUDA on a rather old GPU. I tried the CUDA Samples vectorAdd which gives me the following error:
Failed to launch vectorAdd kernel (error code no kernel image is available for execution on the device)!
These are the outputs from
deviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 580"
CUDA Driver Version / Runtime Version 9.1 / 9.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1467 MBytes (1538392064 bytes)
MapSMtoCores for SM 2.0 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 2.0 is undefined. Default to use 64 Cores/SM
(16) Multiprocessors, ( 64) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1630 MHz (1.63 GHz)
Memory Clock rate: 2050 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.147 Driver Version: 390.147 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 580 Off | 00000000:03:00.0 N/A | N/A |
| 42% 48C P12 N/A / N/A | 257MiB / 1467MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Now according to the CUDA compatibility PDF
https://docs.nvidia.com/pdf/CUDA_Compatibility.pdf
I assume I have Binary Compatibility from CUDA 9.0.176 to the GPU Driver.
For Compute Capability Support, the table does not list the 390 Driver.
Is it even possible to program CUDA on this GPU or should I get a newer one? If it is possible, what combination of driver and CUDA toolkit version do I need?
The GPU you are using is a Fermi class (compute capability 2.0) device. Support was officially removed from the CUDA toolkit when CUDA 9.0 was released in September 2017. The last release of the CUDA toolkit with Fermi support was CUDA 8.0. You will have to use that (or something even older) if you wish to use that GPU with CUDA.
[Answer assembled from comments an added as a community wiki entry to get this question off the unanswered list for the CUDA tag]

Check failed: error == cudaSuccess during training SSD

I am training SSD and I have error as
I0116 13:10:31.206343 3447 net.cpp:761] Ignoring source layer drop6
I0116 13:10:31.207219 3447 net.cpp:761] Ignoring source layer drop7
I0116 13:10:31.207229 3447 net.cpp:761] Ignoring source layer fc8
I0116 13:10:31.207233 3447 net.cpp:761] Ignoring source layer prob
F0116 13:10:31.227303 3447 parallel.cpp:130] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal
*** Check failure stack trace: ***
# 0x7f158382e5cd google::LogMessage::Fail()
# 0x7f1583830433 google::LogMessage::SendToLog()
# 0x7f158382e15b google::LogMessage::Flush()
# 0x7f1583830e1e google::LogMessageFatal::~LogMessageFatal()
# 0x7f158412f7bd caffe::DevicePair::compute()
# 0x7f15841354e0 caffe::P2PSync<>::Prepare()
# 0x7f1584135fee caffe::P2PSync<>::Run()
# 0x40af10 train()
# 0x407608 main
# 0x7f1581fbd830 __libc_start_main
# 0x407ed9 _start
# (nil) (unknown)
Aborted (core dumped)
My Graphic is Quadro4200.
./deviceQuery gives me
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro K4200"
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4034 MBytes (4230479872 bytes)
( 7) Multiprocessors, (192) CUDA Cores/MP: 1344 CUDA Cores
GPU Max Clock rate: 784 MHz (0.78 GHz)
Memory Clock rate: 2700 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K4200
Result = PASS
I can successfully test SSD library, just that I have error in training.
Is that Graphic card not powerful enough to train the library?
I found the error.
If we run this command python examples/ssd/ssd_pascal.py in ssd, the next step of training command is as follow.
gdb --args ./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0,1,2,3 2>&1 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log
this --gpu 0,1,2,3 2>&1 is giving the issue. I changed to --gpu 0 and run from the training command directly as
./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log,
then it solved.

What utility/binary can I call to determine an nVIDIA GPU's Compute Capability?

Suppose I have a system with a single GPU installed, and suppose I've also installed a recent version of CUDA.
I want to determine what's the compute capability of my GPU. If I could compile code, that would be easy:
#include <stdio.h>
int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("%d", prop.major * 10 + prop.minor);
}
but - suppose I want to do that without compiling. Can I? I thought nvidia-smi might help me, since its lets you query all sorts of information about devices, but it seems it doesn't let you obtain the compute capability. Maybe there's something else I can do? Maybe something visible via /proc or system logs?
Edit: This is intended to run before a build, on a system which I don't control. So it must have minimal dependencies, run on a command-line and not require root privileges.
Unfortunately, it looks like the answer at the moment is "No", and that one needs to either compile a program or use a binary compiled elsewhere.
Edit: I have adapted a workaround for this issue - a self-contained bash script which compiles a small built-in C program to determine the compute capability. (It is particualrly useful to call from with CMake, but can just run independently.)
Also, I've filed a feature-requesting bug report at nVIDIA about this.
Here's the script, in a version assuming that nvcc is on your path:
//usr/bin/env nvcc --run "$0" ${1:+--run-args "${#:1}"} ; exit $?
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime_api.h>
int main(int argc, char *argv[])
{
cudaDeviceProp prop;
cudaError_t status;
int device_count;
int device_index = 0;
if (argc > 1) {
device_index = atoi(argv[1]);
}
status = cudaGetDeviceCount(&device_count);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceCount() failed: %s\n", cudaGetErrorString(status));
return -1;
}
if (device_index >= device_count) {
fprintf(stderr, "Specified device index %d exceeds the maximum (the device count on this system is %d)\n", device_index, device_count);
return -1;
}
status = cudaGetDeviceProperties(&prop, device_index);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceProperties() for device device_index failed: %s\n", cudaGetErrorString(status));
return -1;
}
int v = prop.major * 10 + prop.minor;
printf("%d\n", v);
}
We can use nvidia-smi --query-gpu=compute_cap --format=csv to get the compute capability.
Sample output:
compute_cap
8.6
It is available for cuda tool kit 11.6.
You can use deviceQuery utility included in cuda installation
# change cwd into utility source directoy
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# build deviceQuery utility with make as root
$ sudo make
# run deviceQuery
$ ./deviceQuery | grep Capability
CUDA Capability Major/Minor version number: 7.5
# optionally copy deviceQuery in ~/bin for future use
$ cp ./deviceQuery ~/bin
Full ouput from deviceQuery with RTX2080Ti is follows:
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2080 Ti"
CUDA Driver Version / Runtime Version 11.2 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11016 MBytes (11551440896 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1770 MHz (1.77 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Thanks.

deviceQuery program - number of multiprocessors = 0

I have executed the deviceQuery program in the CUDA SDK. The number of mutiprocessors and cores are 0 in the file that I'm sure that is not true.
What the reasons can be?
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 3 devices supporting CUDA
Device 0: "Tesla C2050"
CUDA Driver Version: 4.10
CUDA Runtime Version: 4.10
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 2817982464 bytes
Number of multiprocessors: 0
Number of cores: 0
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: No
Compute mode: Default
(multiple host threads can use this device simultaneously)
Ensure that:
Uninstall all the old graphics drivers and install the latest NVIDIA graphics drivers.
Uninstall all the old CUDA toolkits and install the latest CUDA toolkit.
Make sure you update your nvidia drivers to the latest available and then reboot. If that doesn't fix it, please run the following commands and post the output:
uname -a
nvidia-smi -q
lspci
echo $LD_LIBRARY_PATH
ldd /path/to/deviceQuery
ldconfig -p
devicequery (with all output, not just the card in question)

CUDA 3.0 version compatibility with compiler option -arch=sm_12

I have a very simple CUDA program. The program when compiled with -arch=sm_11 option, works correctly as expected. However, when compiled with -arch=sm_12, the results are unexpected.
Here is the kernel code :
__global__ void dev_test(int *test) {
*test = 100;
}
I invoke the kernel code as below :
int *dev_int, val;
val = 0;
cudaMalloc((void **)&dev_int, sizeof(int));
cudaMemset((void *)dev_int, 0, sizeof(int));
cudaMemcpy(dev_int, &val, sizeof(int), cudaMemcpyHostToDevice);
dev_test <<< 1, 1>>> (dev_int);
int *host_int = (int*)malloc(sizeof(int));
cudaMemcpy(host_int, dev_int, sizeof(int), cudaMemcpyDeviceToHost);
printf("copied back from device %d\n",*host_int);
When compiled with -arch=sm_11, the print statement correctly prints 100.
However when compiled with -arch=sm_12, it prints 0 i.e the changes inside the kernel function is not taking effect. I am guessing this is due to some incompatibility between my CUDA version and the nvidia drivers.
CUDA version - 3.0
NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.36.24 Thu Apr 22 19:10:14 PDT 2010
GCC version: gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
Any help is highly appreciated.
My problem finally got resolved.. Not sure which one truly resolved it - i upgraded to Cuda 4.1 and upgraded my nVidia driver and the combination of the two solved the problem.