How do I know if CUDA can be used? - cuda

Assuming I compile a program which makes use of the CUDA Toolkit and I run the program on hardware that does not support the required compute capability or maybe doesn't even have an NVIDIA GPU supporting the CUDA interface, how do I know from a programming-level? In order to fall back on CPU procedures or show error-messages.

If you have the CUDA Toolkit with samples installed already, I suggest that you look at the deviceQuery project. This shows an example on how to query the device for attributes such as the Capability Major/Minor version number.
Short snippet attached:
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
printf("\nDevice %d: \"%s\"\n", dev, deviceProp.name);
// Console log
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n", driverVersion/1000, (driverVersion%100)/10, runtimeVersion/1000, (runtimeVersion%100)/10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n", deviceProp.major, deviceProp.minor);
As for if the the system doesn't have a GPU, you could use the code snippet below although I believe you need to have static libraries at that point.
int deviceCount = 0;
cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
if (error_id != cudaSuccess)
{
printf("cudaGetDeviceCount returned %d\n-> %s\n", (int)error_id, cudaGetErrorString(error_id));
exit(EXIT_FAILURE);
}
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0)
{
printf("There are no available device(s) that support CUDA\n");
}
else
{
printf("Detected %d CUDA Capable device(s)\n", deviceCount);
}

Related

grid_group not found in CUDA 9

I tried using Cooperative Groups in CUDA 9, but I get an error in compiling.
Does anyone know the solution?
The development environment is as follows:
CUDA 9
Kepler K80
Compute Capability: 3.7
#include <cstdint>
#include <iostream>
#include <vector>
#include <cooperative_groups.h>
__global__
void kernel(uint32_t values[])
{
using namespace cooperative_groups;
grid_group g = this_grid();
}
int main(void)
{
constexpr uint32_t kNum = 1 << 24;
std::vector<uint32_t> h_values(kNum);
uint32_t *d_values;
cudaMalloc(&d_values, sizeof(uint32_t) * kNum);
cudaMemcpy(d_values, h_values.data(), sizeof(uint32_t) * kNum, cudaMemcpyHostToDevice);
const uint32_t thread_num = 256;
const dim3 block(thread_num);
const dim3 grid((kNum + block.x - 1) / block.x);
void *params[] = {&d_values};
cudaLaunchCooperativeKernel((void *)kernel, grid, block, params);
cudaMemcpy(h_values.data(), d_values, sizeof(uint32_t) * kNum, cudaMemcpyDeviceToHost);
cudaFree(d_values);
return 0;
}
$ nvcc -arch=sm_37 test.cu --std=c++11 -o test
test.cu(12): error: identifier "grid_group" is undefined
test.cu(12): error: identifier "this_grid" is undefined
The grid_group features are only supported in the Pascal architecture and later.
You can try by compiling for, e.g., sm_60 (of course the executable won't run on your GPU). Additionally you need to enable relocatable device code (-rdc=true).
Unfortunately, the Programming Guide is not very clear about that. I couldn't find this information there. However it is mentioned in some posts on devblog.nvidia.com:
From https://devblogs.nvidia.com/cuda-9-features-revealed/
While Cooperative Groups works on all GPU architectures, certain functionality is inevitably architecture-dependent as GPU capabilities have evolved. Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.
Or at the very end of https://devblogs.nvidia.com/cooperative-groups/
New features in Pascal and Volta GPUs help Cooperative Groups go farther, by enabling creation and synchronization of thread groups that span an entire kernel launch running on one or even multiple GPUs.

Getting error 255 when compiling with nvcc [duplicate]

It seems that printf doesn't work inside the Kernel of a cuda code
#include "Common.h"
#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
int idx = threadIdx.x ;
int idy = threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
printf("idx=%d, idy=%d, size=%d\n", idx, idy, size);
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) { // NO Thread divergence here
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
b_d[idy*(size+1) + idx] = temp[idy][idx];
}
when compiling, it says:
error: calling a host function("printf") from a __device__/__global__ function("Kernel") is not allowed
The cuda version is 4
Quoting the CUDA Programming Guide "Formatted output is only supported by devices of compute capability 2.x and higher". See the programming guide for additional information.
Devices of compute capability < 2.x can use cuPrintf.
If you are on a 2.x and above device and you are trying to use printf make sure you have specified arch=sm_20 (or higher). The default is sm_10 which does not have sufficient features to support printf.
NVIDIA offers three source level debuggers for CUDA. You may find these more useful than printf for inspecting variables.
- Nsight Visual Studio Edition CUDA Debugger
- Nsight Eclipse Edition CUDA Debugger
- cuda-gdb
You need to use cuPrintf, as in this example. Note that printf is a pretty limited way of debugging, the Nsight or Nsight eclipse edition IDEs are much nicer.

Invalid device error returned by cuPointerGetAttribute()

This question is the same that was previously asked here without being answered:
https://stackoverflow.com/questions/22996075/invalid-device-error-return-by-cupointergetattribute.
I use a GTX680, CUDA 6.5 toolkit, NVIDIA 340.46 kernel. The GPU has unified addressing capability and compute capability 3.0.
The following code returns CUDA_ERROR_INVALID_DEVICE:
CUDA_DR_ASSERT(cuMemAlloc(&dev_ptr, size));
CUDA_DR_ASSERT(cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr));
Has anyone (Sankar?) had similar problems and found the reason?
edit: this is the code I get the errors from:
CUDA_DR_ASSERT( cuInit(0) );
CUdevice dev;
CUDA_DR_ASSERT( cuDeviceGet(&dev, 0) );
CUDA_ASSERT(cudaSetDevice(dev));
CUdeviceptr dev_ptr;
std::size_t size = 2*65536;
CUDA_DR_ASSERT( cuMemAlloc( &dev_ptr, size ) );
uint flag = 1; // set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS (set to 0 for unsetting this option)
CUDA_DR_ASSERT( cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_ptr) );
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
CUDA_DR_ASSERT( cuPointerGetAttribute( &tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr ) );
I have a system with a Quadro GPU at device 0 and a GeForce GPU at device 1.
Here's a fully worked example:
$ cat t642.cpp
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>
int main(int argc, char *argv[]){
int my_dev = 0;
int dev_count = 0;
if (argc > 1) my_dev=atoi(argv[1]);
CUcontext my_ctx;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&dev_count));
if (my_dev > dev_count-1) {printf("device does not exist\n"); return 1;}
char deviceName[256];
checkCudaErrors(cuDeviceGetName(deviceName, 256, my_dev));
printf("using device %d, %s\n", my_dev, deviceName);
checkCudaErrors(cuCtxCreate(&my_ctx, 0, my_dev));
CUdeviceptr dev_ptr;
size_t size = 256;
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
checkCudaErrors(cuMemAlloc(&dev_ptr, size));
checkCudaErrors(cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr));
printf("success!\n");
return 0;
}
$ g++ -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc t642.cpp -lcuda -o t642
$ ./t642 0
using device 0, Quadro 5000
success!
$ ./t642 1
using device 1, GeForce GT 640
checkCudaErrors() Driver API error = 0101 "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)" from file <t642.cpp>, line 22.
$
Using a GeForce GPU with this mechanism (which is designed in support of GPUDirect RDMA) is not suppoorted. This is documented in the GPUDirect RDMA documentation, which states:
GPUDirect RDMA is available on both Tesla and Quadro GPUs.
And while it is not the crux of your issue, you may also wish to read the GPUDirect RDMA release notes, that indicate that this token mechanism was deprecated in CUDA 6.0.

Dynamic parallelism cudaDeviceSynchronize() crashes

I have a kernel which calls another empty kernel. However when the calling kernel calls cudaDeviceSynchronize(), the kernel crashes and the execution goes straight to the host. Memory checker does not report of any memory access issues.
Does anyone know what could be the reason for such uncivilized behavior?
The crash seems to happen only if I run the code from the debugger (Visual Studio -> Nsight -> Start CUDA Debugging).
The crash does not happen every time I run the code - sometimes it crashes, and sometimes it finishes ok.
Here is the complete code to reproduce the problem:
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include "device_launch_parameters.h"
#include <stdio.h>
#define CUDA_RUN(x_, err_) {cudaStatus = x_; if (cudaStatus != cudaSuccess) {fprintf(stderr, err_ " %d - %s\n", cudaStatus, cudaGetErrorString(cudaStatus)); int k; scanf("%d", &k); goto Error;}}
struct computationalStorage {
float rotMat;
};
__global__ void drawThetaFromDistribution() {}
__global__ void chainKernel() {
computationalStorage* c = (computationalStorage*)malloc(sizeof(computationalStorage));
if (!c) printf("malloc error\n");
c->rotMat = 1.0f;
int n = 1;
while (n < 1000) {
cudaError_t err;
drawThetaFromDistribution<<<1, 1>>>();
if ((err = cudaGetLastError()) != cudaSuccess)
printf("drawThetaFromDistribution Sync kernel error: %s\n", cudaGetErrorString(err));
printf("0");
if ((err = cudaDeviceSynchronize()) != cudaSuccess)
printf("drawThetaFromDistribution Async kernel error: %s\n", cudaGetErrorString(err));
printf("1\n");
++n;
}
free(c);
}
int main() {
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
CUDA_RUN(cudaSetDevice(0), "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
// Set to use on chip memory 16KB for shared, 48KB for L1
CUDA_RUN(cudaDeviceSetCacheConfig ( cudaFuncCachePreferL1 ), "Can't set CUDA to use on chip memory for L1");
// Set a large heap
CUDA_RUN(cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 10 * 192), "Can't set the Heap size");
chainKernel<<<10, 192>>>();
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
printf("Something was wrong! Error code: %d", cudaStatus);
}
CUDA_RUN(cudaDeviceReset(), "cudaDeviceReset failed!");
Error:
int k;
scanf("%d",&k);
return 0;
}
If all goes well I expect to see:
00000000000000000000000....0000000000000001
1
1
1
1
....
This is what I get when everything works ok. When it crashes however:
000000000000....0000000000000Something was wrong! Error code: 30
As you can see the statement err = cudaDeviceSynchronize(); does not finish, and the execution goes straight to the host, where its cudaDeviceSynchronize(); fails with unknown error code (30 = cudaErrorUnknown).
System: CUDA 5.5, NVidia-Titan(Headless), Windows 7x64, Win32 application.
UPDATE: additional Nvidia card driving the display, Nsight 3.2.0.13289.
That last fact may have been the critical one. You don't mention which version of nsight VSE you are using nor your exact machine config (e.g. are there other GPUs in the machine, if so, which is driving the display?), but at least up till recently it was not possible to debug a dynamic parallelism application in single-GPU mode with nsight VSE.
The current feature matrix also suggests that single-GPU CDP debugging is not yet supported.
Probably one possible workaround in your case would be to add another GPU to drive the display, and make the Titan card headless (i.e. don't attach any monitors and don't extend the windows desktop onto that GPU).
I ran your application with and without cuda-memcheck and it does not appear to me that there are any problems with it.

not able to use printf in cuda kernel function

It seems that printf doesn't work inside the Kernel of a cuda code
#include "Common.h"
#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
int idx = threadIdx.x ;
int idy = threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
printf("idx=%d, idy=%d, size=%d\n", idx, idy, size);
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) { // NO Thread divergence here
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
b_d[idy*(size+1) + idx] = temp[idy][idx];
}
when compiling, it says:
error: calling a host function("printf") from a __device__/__global__ function("Kernel") is not allowed
The cuda version is 4
Quoting the CUDA Programming Guide "Formatted output is only supported by devices of compute capability 2.x and higher". See the programming guide for additional information.
Devices of compute capability < 2.x can use cuPrintf.
If you are on a 2.x and above device and you are trying to use printf make sure you have specified arch=sm_20 (or higher). The default is sm_10 which does not have sufficient features to support printf.
NVIDIA offers three source level debuggers for CUDA. You may find these more useful than printf for inspecting variables.
- Nsight Visual Studio Edition CUDA Debugger
- Nsight Eclipse Edition CUDA Debugger
- cuda-gdb
You need to use cuPrintf, as in this example. Note that printf is a pretty limited way of debugging, the Nsight or Nsight eclipse edition IDEs are much nicer.