"device-function-maxrregcount" message while compiling cuda code - cuda

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code:
using namespace std;
__global__ void ker(float * a, float * c,long long result_size,int n, int m)
{
float *sum;
int id = blockIdx.x*blockDim.x+threadIdx.x;
float *out1,*out2;
int k;
if(id<result_size)
{
cublasHandle_t handle;
cublasCreate(&handle);
out1 = a + id*m;
for(k=0;k<n;k++)
{
out2 =a + k*m;
cublasSdot(handle, m,out1,1,out2,1,sum);
c[id*n + k]= *sum;
}
}
}
int main()
{
int n=70000,m=100;
long result_size=n;
result_size*=n;
float * dev_data,*dev_result;
float * data = new float [n*m];
float * result = new float [result_size];
for (int i = 0; i< n; i++)
for(int j = 0; j <m;j++)
{
data[i*m+j]=rand();
}
cudaMalloc ((void**)&dev_data,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_result,sizeof(float)*result_size);
cudaMemcpy( dev_data, data, sizeof(float) * m* n, cudaMemcpyHostToDevice);
int block_size=1024;
int grid_size=ceil((float)result_size/(float)block_size);
ker<<<grid_size,block_size>>>(dev_data,dev_result,result_size,n,m);
cudaDeviceSynchronize();
cudaMemcpy(result, dev_result, sizeof(float)*(result_size), cudaMemcpyDeviceToHost);
return 0;
}
I have included cublas_v2 library and used the following command to compile the code:
nvcc -lcublas_device -arch=sm_35 -rdc=true askstack.cu -o askstack
But I got the following message:
ptxas info : 'device-function-maxrregcount' is a BETA feature
Can anyone please let me know what should I do regarding this message?

This message is informational, as said by talonmies.
This maxregcount option of NVCC is used to specify a limit of registers that can be used by a kernel and all the device functions it uses :
If a kernel is limited to a certain number of registers with the launch_bounds attribute or the --maxrregcount option, then all functions that the kernel calls must not use more than that number of registers; if they exceed the limit, then a link error will be given.
See : NVCC Doc : 6.5.1. Object Compatibility
It seems that device-function-maxregcount is used to override this value for device functions only. So, you can have a different maximum amount of registers allowed on kernels and device functions.
For device functions, this option overrides the value specified by --maxregcount.
Source : The CUDA Handbook

Related

In CUDA why can't I allocate 2d shared memory dynamically?

The following works fine;
__extern__ float dyanimicSh1D[];
But the following does not work:
__extern__ float dyanimicSh2D[][];
I want to understand why it is so?
You can't do it because the compiler needs the width information for the array to generate code that does proper indexing.
If you allocate shared memory in a static fashion like this:
__shared__ float sarr[24][12];
Then not only are you telling how much memory to allocate/provide, you are also giving the width of the array (12 in this example). This is important, because a static 2D array of this type is not treated under the hood as an array of pointers, but instead it is a flat allocation, with indexing created by the compiler, at compile-time.
so that later when you do something like this:
float val = sarr[y][x];
the compiler will take the sarr pointer, and do pointer arithmetic to add x + (y*12) to it, before dereferencing that pointer to retrieve the value. The 12 in that calculation is discovered at compile-time and used by the compiler in generating the code to do the indexing.
Doing something like this:
extern __shared__ float sarr[][];
doesn't supply the array width information to the compiler, so it cannot generate the indexing needed at compile time, and is not allowed.
By the way, this works:
extern __shared__ float sarr[][12];
Here is an example:
$ cat t46.cu
#include <cstdio>
__global__ void k(int x, int y){
extern __shared__ float sarr[][12];
for (int i = 0; i < 32; i ++)
for (int j = 0; j < 12; j++)
sarr[i][j] = i * 256 + j;
float val = sarr[y][x];
printf("%f\n", val);
}
int main(){
k<<<1,1,128*12>>>(3,2);
cudaDeviceSynchronize();
}
$ nvcc -o t46 t46.cu
$ cuda-memcheck ./t46
========= CUDA-MEMCHECK
515.000000
========= ERROR SUMMARY: 0 errors
$

cuda::cub error calling a __host__ function from a __device__ functionis not allowed

I use cub::DeviceReduce::Sum to compute the summation of a vector, but it gave me the error :
error: calling a __host__ function("cub::DeviceReduce::Sum<double *, double *> ") from a __device__ function("dotcubdev") is not allowed
error: identifier "cub::DeviceReduce::Sum<double *, double *> " is undefined in device code
The code sample is as follows:
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
The code can run successfully in the "main{}" body, but it can't work in the function.
To use a cub device-wide function from device code, it is necessary to build your project to support CUDA dynamic parallelism. In the cub documentation, this is indicated here:
Usage Considerations
Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
For example, you can compile the code you have shown with:
$ cat t1364.cu
#include <cub/cub.cuh>
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
$ nvcc -arch=sm_35 -dc t1364.cu
$
(CUDA 9.2, CUB 1.8.0)
This means CUB will be launching child kernels to get the work done.
This is not a complete tutorial on how to use CUDA Dynamic Parallelism (CDP). The above is the compile command only and omits the link step. There are many questions here on the cuda tag which discuss CDP, you can read about it in two blog articles and the programming guide, and there are CUDA sample projects showing how to compile and use it.

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

argument of type "int *" is incompatible with parameter of type "int" in cuda kernel call

I've been trying for a while and have come across seemingly similar issues already posted however for some reason I'm still failing to clear the error. I'm effectively want to pass a 2D matrix to the kernel as a 1D array as I have seen suggested. I'm not sure where I've gone wrong in my syntax but there is a clash in terms of the variable I supply to the kernel and the parameter that kernel expects.
__global__ void calculatePath(int source, int target, int *cost, int distance){
int t_id = blockIdx.x * blockDim.x + threadIdx.x;
int dist[50];
int prev[50];
int selected[50]={0};
int num_path[50];
int d, m, min, start, j;
if ((t_id > 0) && (t_id < N)){
dist[t_id] = IN;
prev[t_id] = -1;
}
This is my kernel function whose parameters are all integers except "cost" which is a pointer to an integer array.
int main(int argc, char **argv){
int h_num_path[N];
int h_distance = 0;
int h_cost[N][N],i,j,co;
int h_source;
int h_target;
printf("\tShortest Path Algorithm(DIJKSRTRA's ALGORITHM\n\n");
for(i=0;i< N;i++)
for(j=0;j< N;j++)
h_cost[i][j] = IN;
//*********************
srand ( time(NULL));
for(int x=1;x< N;x++) {
for (int y = x + 1; y < N; y++) {
h_cost[x][y] = h_cost[y][x] = (rand() % 100) + 1;
}
}
printf("\nEnter The Source: ");
scanf("%d", &h_source);
printf("\nEnter The target: ");
scanf("%d", &h_target);
int *d_num_path;
int *d_cost;
int *d_source;
int *d_target;
int *d_dist;
int *d_prev;
int *d_distance;
cudaMalloc(&d_num_path, sizeof(int)*N);
cudaMalloc(&d_cost, sizeof(int)*N*N);
cudaMalloc((void**) &d_source, sizeof(int));
cudaMalloc((void**) &d_target, sizeof(int));
cudaMalloc((void**) &d_dist, sizeof(int)*N);
cudaMalloc((void**) &d_distance, sizeof(int));
cudaMemcpy(d_source, &h_source, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_target, &h_target, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_cost, h_cost, sizeof(int)*N*N, cudaMemcpyHostToDevice);
cudaMemcpy(d_distance, &h_distance, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_num_path, &h_num_path, sizeof(int)*N, cudaMemcpyHostToDevice);
clock_t before;
before = clock();
calculatePath<<<N/512 + 1, 512>>>(d_source, d_target, d_cost, d_distance);
clock_t time_taken = clock() - before;
cudaMemcpy(&h_num_path, d_num_path, sizeof(int)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(&h_distance, d_distance, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_num_path);
cudaFree(d_cost);
cudaFree(d_source);
cudaFree(d_target);
cudaFree(d_dist);
cudaFree(d_prev);
cudaFree(d_distance);
printf("\nShortest Path: %d \n",co);
printf("%s %.4f %s", "Time taken:", time_taken/1000.0, "seconds");
return 0;
}
On the kernel call, I however receive the error that "argument of type 'int *' is incompatible with parameter of type 'int'" yet I believe my d_cost already is a pointer. I'd appreciate being set straight as I'm sure I'm overlooking something small.
It is not d_target you are having trouble with. The other three arguments are int* but corresponding parameters are declared as int.
The C Programming Language by K&R at page 25 says:
We will generally use parameter for a variable named in the parenthesized list in a function definition, and argument for the value used in a call of the function.
Since your source and target are just a single integer values, you don't really need to define device side variables for them. Just pass the integer value itself as an argument. By doing so, you'll get performance improvements as talonmies commented:
(With pass by value) there is constant memory cache broadcast within the kernel if it is done that way. Passing pointers for simple constants just increases latency by forcing every thread to dereference the pointer to retrieve the value from global memory, plus all the additional host side memory APIs to allocate them in the first place.
Also, you seem to expect parameter distance to have output value of your kernel, then it must be declared as a pointer, so you can do cudaMemcpyDeviceToHost after kernel.
__global__ void calculatePath(int source, int target, int *cost, int *distance) // kernel definition
caculatePath<<< (N + 511) / 512, 512 >>>(h_source, h_target, d_cost, d_distance) // kernel launch
Three of your arguments need to be integers, but you are passing pointers to integers. You need to change your method signature:
__global__ void calculatePath(int *source, int *target, int *cost, int *distance)

External call of a class method in a kernel

I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?
This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?