double ** int function parameter - function

Can somebody explain me why does the parameter has double "**"? Like, I know it's the equivalent of the "by reference" in C++, but I need more explanations, please.
int crearevect(int **v)
{
int nr,i;
scanf("%d",&nr);
*v=(int *)(malloc(nr*sizeof(int)));
for (i=0; i<nr; i++)
printf("%p ",((*v)+i));
printf("%p",v);
return nr;
}
// v[i] = *(v+i)
// *(v)[i] = *(*(v)+i)
void creareMATRICE(int ***a, int *n, int *m)
{
scanf("%d",n);
scanf("%d",m);
*a=(int **)(malloc(*n*sizeof(int)));
int i,j;
for (i=0; i<*n; i++)
(*a)[i]=(int *)(malloc(*m*sizeof(int)));
for (i=0; i<*n; i++)
for (j=0; j<*m; j++)
scanf("%d",&(*a)[i][j]);
return;
}

in C refers to pointer. A * refers to a variable that holds memory address, similarly ** refers to a variable that hold memory address of a memory address, likewise *** and so on. You can read more about pointers from https://www.tutorialspoint.com/cprogramming/c_pointers.htm or any other online reference. But in general you can link a * with variable that can be used to address 1D array (v[i] = *(v+i)), ** with variable that can be used to address 2D array (v[i][j] = ((v+i) + j)) and so on.

Related

Passing a row of pointers to __global__ function

I am trying to pass a row of pointers of a two dimensional array of pointers in CUDA. See my code below. Here the array of pointers is noLocal. Because I am doing an atomicAdd I am expecting a number different of zero in line printf("Holaa %d\n", local[0][0]);, but the value I get is 0. Could you help me to pass an arrow in CUDA by reference, please?
__global__ void myadd(int *data[8])
{
unsigned int x = blockIdx.x;
unsigned int y = threadIdx.x;
unsigned int z = threadIdx.y;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i; i<8; i++)
atomicAdd(&(*data)[i],10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20][8];
for (int d = 0; d< 20;d++) {
for (int dd = 0; dd< 8; dd++) {
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
cudaMemcpy(noLocal[d][dd], &(local[d][dd]), sizeof(int), cudaMemcpyHostToDevice);
}
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
for (int dd = 0; dd < 8; dd++)
cudaMemcpy(&(local[d][dd]), noLocal[d][dd], sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
for (int dd = 0; dd < 8; dd++)
cudaFree(noLocal[d][dd]);
}
I believe you received good advice in the other answer. I don't recommend this coding pattern. For general reference material on creating 2D arrays in CUDA, see this answer.
When I compile the code you have shown, I get warnings of the form "i is used before its value is set". This kind of warning should not be ignored. It arises from this statement which doesn't make sense to me:
for (int i; i<8; i++)
that should be:
for (int i = 0; i<8; i++)
It's not clear you understand the C++ concepts of pointers and arrays. This:
int local[20][8] = { 0 };
represents an array of 20x8 = 160 integers. If you want to imagine it as an array of pointers, you could pretend that it includes 20 pointers of the form local[0], local[1]..local[19]. Each of those "pointers" points to an array of 8 integers. But there is no sensible comparison to suggest that it has 160 pointers in it. Furthermore the usage pattern you indicate in your kernel does not suggest that you expect 160 pointers to individual integers. But that is exactly what you are creating here:
int *noLocal[20][8]; //this is declaring a 2D array of 160 *pointers*
for (int d = 0; d< 20;d++) { // the combination of these loops means
for (int dd = 0; dd< 8; dd++) { // you will create 160 *pointers*
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
To mimic your host array (local) you want to create 20 pointers each of which is pointing to an allocation of 8 int quantities. The usage in your kernel code here:
&(*data)[i]
means that you intend to take a single pointer, and offset it by i values ranging from 0 to 7. It does not mean that you expect to receive 8 individual pointers. Again, this is C++ behavior, not unique or specific to CUDA.
In order to make your code "sensible" there were a variety of changes I had to make. Here's a "fixed" version:
$ cat t1858.cu
#include <cstdio>
__global__ void myadd(int data[8])
{
// unsigned int x = blockIdx.x;
// unsigned int y = threadIdx.x;
// unsigned int z = threadIdx.y;
// int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i = 0; i<8; i++)
atomicAdd(data+i,10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20];
for (int d = 0; d< 20;d++) {
cudaMalloc(&(noLocal[d]), 8*sizeof(int));
cudaMemcpy(noLocal[d], local[d], 8*sizeof(int), cudaMemcpyHostToDevice);
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
cudaMemcpy(local[d], noLocal[d], 8*sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
cudaFree(noLocal[d]);
}
$ nvcc -o t1858 t1858.cu
$ cuda-memcheck ./t1858
========= CUDA-MEMCHECK
Holaa 40000
========= ERROR SUMMARY: 0 errors
$
The number 40000 is correct. It comes about because every thread is doing an atomic add of 10, and you have 20x200 threads that are doing that. 10x20x200 = 40000.
You should simply not be doing anything like that. You are wasting time and memory with these excessive allocations. And - your kernel would be pretty slow as well. I am 100% certain this is not what you were asked, nor what you wanted, to do.
Instead, you should:
Allocate a single large buffer on the device to fit the data you need.
Avoid using pointers on the device side, except to that buffer, unless absolutely necessary.
If you somehow have to use a 2D pointer array - add relevant offsets to your buffer's base pointer to get different pointers into it.

Finding triplets in CUDA kernel

I have p.ntp test particles and every i-th particle has Cartesian coordinates tp.rh[i].x, tp.rh[i].y, tp.rh[i].z. Within this set I need to find CLUSTERS. It means, that I am looking for particles closer to the i-th particle less than hill2 (tp.D_rel < hill2). The number of such a members is stored in N_conv.
I use this cycle for (int i = 0; i < p.ntp; i++), which goes through the data set. For each i-th particle I calculate squared distances tp.D_rel[idx] relative to the others members in the set. Then I use first thread (idx == 0) to find the number of cases, which satisfy my condition. At the end, If are there more than 1 (N_conv > 1) positive cases I need to write out all particles forming possible cluster together (triplets, ...).
My code works well only in cases, where i < blockDim.x. Why? Is there a general way, how to find clusters in a set of data, but write out only triplets and more?
Note: I know, that some cases will be found twice.
__global__ void check_conv_system(double t, struct s_tp tp, struct s_mp mp, struct s_param p, double *time_step)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
double hill2 = 1.0e+6;
__shared__ double D[200];
__shared__ int ID1[200];
__shared__ int ID2[200];
if (idx >= p.ntp) return;
int N_conv;
for (int i = 0; i < p.ntp; i++)
{
tp.D_rel[idx] = (double)((tp.rh[i].x - tp.rh[idx].x)*(tp.rh[i].x - tp.rh[idx].x) +
(tp.rh[i].y - tp.rh[idx].y)*(tp.rh[i].y - tp.rh[idx].y) +
(tp.rh[i].z - tp.rh[idx].z)*(tp.rh[i].z - tp.rh[idx].z));
__syncthreads();
N_conv = 0;
if (idx == 0)
{
for (int n = 0; n < p.ntp; n++) {
if ((tp.D_rel[n] < hill2) && (i != n)) {
N_conv = N_conv + 1;
D[N_conv] = tp.D_rel[n];
ID1[N_conv] = i;
ID2[N_conv] = n;
}
}
if (N_conv > 0) {
for(int k = 1; k < N_conv; k++) {
printf("%lf %lf %d %d \n",t/365.2422, D[k], ID1[k], ID2[k]);
}
}
} //end idx == 0
} //end for cycle for i
}
As RobertCrovella mentionned, without an MCV example, it is hard to tell.
However, the tp.D_del array seems to be written to with idx index, and read-back after a __syncthreads() with full range indexing n. Note that the call to __syncthreads() will only perform synchronization within a block, not accross the whole device. As a result, some thread/block will access data that has not been calculated yet, hence the failure.
You want to review your code so that values computed by blocks do not depend one-another.

Printing inside a cuda __global__ function kills the thread

I'm having a weird problem with my code. If I try to print the value of a certain variable inside a thread nothing gets written to the screen and all the threads stop at that point. Here is the code:
#define WINSIZE 1
const int nebsize=(WINSIZE*2+1)*(WINSIZE*2+1);
__global__ void loop(double *img, int *consts, int w, int h, double epsilon){
int ind=blockIdx.x*blockDim.x+threadIdx.x;
if(ind<w*h && !consts[ind] && ind%w>=WINSIZE && ind%w<w-WINSIZE && ind/w>=WINSIZE && ind/w<h-WINSIZE){
int win_inds[nebsize];
double winI[3*(2*WINSIZE+1)*(2*WINSIZE+1)];
double winI_re_aux[3*nebsize];
double pre_win_var[9];
double win_var[9];
double win_mu[3];
double tvals[nebsize*nebsize];
double detwin;
int min_i=ind%w-WINSIZE;
int max_i=ind%w+WINSIZE;
int min_j=ind/w-WINSIZE;
int max_j=ind/w+WINSIZE;
int k;
int l;
k=0;
for(int i=min_i; i<=max_i; i++){
for(int j=min_j; j<=max_j; j++){
win_inds[k]=h*i+j;
k++;
}
}
k=0;
for(int j=min_j; j<=max_j; j++){
l=0;
for(int i=min_i; i<=max_i; i++){
winI[3*(l*(2*WINSIZE+1)+k)]=img[3*(j*w+i)];
winI[3*(l*(2*WINSIZE+1)+k)+1]=img[3*(j*w+i)+1];
winI[3*(l*(2*WINSIZE+1)+k)+2]=img[3*(j*w+i)+2];
l++;
}
k++;
}
win_mu[0]=0;
win_mu[1]=0;
win_mu[2]=0;
for(int i=0; i<nebsize; i++){
win_mu[0]+=winI[3*i];
win_mu[1]+=winI[3*i+1];
win_mu[2]+=winI[3*i+2];
}
win_mu[0]=win_mu[0]/(double)nebsize;
win_mu[1]=win_mu[1]/(double)nebsize;
win_mu[2]=win_mu[2]/(double)nebsize;
//all ok here
//this works here
if(ind==200){
printf("%f\n", win_var[8]);
}
for(int i=0; i<3; i++){
for(int j=0; j<3; j++){
pre_win_var[3*i+j]=0;
for(int n=0; n<nebsize; n++){
pre_win_var[3*i+j]+=winI[3*n+i]*winI[3*n+j];
}
pre_win_var[3*i+j]=pre_win_var[3*i+j]/(double)nebsize;
pre_win_var[3*i+j]+=(i==j)*epsilon/(double)nebsize-win_mu[j]*win_mu[i];
}
}
//this kills all threads
if(ind==200){
printf("%f\n", win_var[8]);
}
detwin=pre_win_var[0]*pre_win_var[4]*pre_win_var[8]+pre_win_var[2]*pre_win_var[3]*pre_win_var[7]+pre_win_var[1]*pre_win_var[5]*pre_win_var[6];
detwin-=pre_win_var[6]*pre_win_var[4]*pre_win_var[2]+pre_win_var[3]*pre_win_var[1]*pre_win_var[8]+pre_win_var[7]*pre_win_var[5]*pre_win_var[0];
win_var[0]=(pre_win_var[4]*pre_win_var[8]-pre_win_var[5]*pre_win_var[7])/detwin;
win_var[3]=-(pre_win_var[3]*pre_win_var[8]-pre_win_var[5]*pre_win_var[6])/detwin;
win_var[6]=(pre_win_var[3]*pre_win_var[7]-pre_win_var[4]*pre_win_var[6])/detwin;
win_var[1]=-(pre_win_var[1]*pre_win_var[8]-pre_win_var[2]*pre_win_var[7])/detwin;
win_var[4]=(pre_win_var[0]*pre_win_var[8]-pre_win_var[2]*pre_win_var[6])/detwin;
win_var[7]=-(pre_win_var[0]*pre_win_var[7]-pre_win_var[1]*pre_win_var[6])/detwin;
win_var[2]=(pre_win_var[1]*pre_win_var[5]-pre_win_var[2]*pre_win_var[4])/detwin;
win_var[5]=-(pre_win_var[0]*pre_win_var[5]-pre_win_var[2]*pre_win_var[3])/detwin;
win_var[8]=(pre_win_var[0]*pre_win_var[4]-pre_win_var[1]*pre_win_var[3])/detwin;
//this line gets executed in all threads if I printf nothing
consts[ind]=666;
}
}
Printing the values of win_var or pre_win_var is possible only before the values are calculated, but if I try to print them after that it seems to kill all the threads. If I print nothing the line consts[ind]=666 gets executed in all threads, I know it because I can copy consts back to the host memory and print it. So, anyone has any idea of what's wrong?
The problem appears to be one of resource exhaustion. You are getting cudaErrorLaunchOutOfResources at launch with printf enabled because of the larger register footprint of the kernel with the ABI call included.
You didn't provide any details about your launch parameters, but reducing the total threads per block to a smaller multiple of 32 should cure the problem.

copy global memory by CUDA threads

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).
My code is as follows:
__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int start, end;
start = some_func(idx);
end = another_func(idx);
unsigned int i;
for (i = start; i < end; i++) {
g_data2[i] = g_data1[idx];
}
}
It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?
Thank you,
Zheng
The way you wrote it, I am guessing each thread is trying to write the whole 'start' to 'end' chunk. Which is really really inefficient.
you need to do something like this.
___shared___ unsigned sm_start[BLOCK_SIZE];
___shared___ unsigned sm_end[BLOCK_SIZE];
sm_start[threadIdx.x] = start;
sm_end[threadIdx.y] = end;
__syncthreads();
for (int n = 0; n < blockdDim.x; n++) {
g_data2 += sm_start[n];
unsigned lim = sm_end[n] - sm_start[n];
for (int i = threadIdx.x; i < lim; i += blockDim.x) {
g_data2[i] = g_data1[idx];
}
}
try using this:
CUresult cuMemcpyDtoD(
CUdeviceptr dst,
CUdeviceptr src,
unsigned int bytes
)
UPDATE:
You're right: http://forums.nvidia.com/index.php?showtopic=88745
There is no efficient way to do this properly because the design of CUDA wants you to use only small amount of data in the kernel.

CUDA memory troubles

I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).
Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];
The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.