Passing a row of pointers to __global__ function - cuda

I am trying to pass a row of pointers of a two dimensional array of pointers in CUDA. See my code below. Here the array of pointers is noLocal. Because I am doing an atomicAdd I am expecting a number different of zero in line printf("Holaa %d\n", local[0][0]);, but the value I get is 0. Could you help me to pass an arrow in CUDA by reference, please?
__global__ void myadd(int *data[8])
{
unsigned int x = blockIdx.x;
unsigned int y = threadIdx.x;
unsigned int z = threadIdx.y;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i; i<8; i++)
atomicAdd(&(*data)[i],10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20][8];
for (int d = 0; d< 20;d++) {
for (int dd = 0; dd< 8; dd++) {
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
cudaMemcpy(noLocal[d][dd], &(local[d][dd]), sizeof(int), cudaMemcpyHostToDevice);
}
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
for (int dd = 0; dd < 8; dd++)
cudaMemcpy(&(local[d][dd]), noLocal[d][dd], sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
for (int dd = 0; dd < 8; dd++)
cudaFree(noLocal[d][dd]);
}

I believe you received good advice in the other answer. I don't recommend this coding pattern. For general reference material on creating 2D arrays in CUDA, see this answer.
When I compile the code you have shown, I get warnings of the form "i is used before its value is set". This kind of warning should not be ignored. It arises from this statement which doesn't make sense to me:
for (int i; i<8; i++)
that should be:
for (int i = 0; i<8; i++)
It's not clear you understand the C++ concepts of pointers and arrays. This:
int local[20][8] = { 0 };
represents an array of 20x8 = 160 integers. If you want to imagine it as an array of pointers, you could pretend that it includes 20 pointers of the form local[0], local[1]..local[19]. Each of those "pointers" points to an array of 8 integers. But there is no sensible comparison to suggest that it has 160 pointers in it. Furthermore the usage pattern you indicate in your kernel does not suggest that you expect 160 pointers to individual integers. But that is exactly what you are creating here:
int *noLocal[20][8]; //this is declaring a 2D array of 160 *pointers*
for (int d = 0; d< 20;d++) { // the combination of these loops means
for (int dd = 0; dd< 8; dd++) { // you will create 160 *pointers*
cudaMalloc(&(noLocal[d][dd]), sizeof(int));
To mimic your host array (local) you want to create 20 pointers each of which is pointing to an allocation of 8 int quantities. The usage in your kernel code here:
&(*data)[i]
means that you intend to take a single pointer, and offset it by i values ranging from 0 to 7. It does not mean that you expect to receive 8 individual pointers. Again, this is C++ behavior, not unique or specific to CUDA.
In order to make your code "sensible" there were a variety of changes I had to make. Here's a "fixed" version:
$ cat t1858.cu
#include <cstdio>
__global__ void myadd(int data[8])
{
// unsigned int x = blockIdx.x;
// unsigned int y = threadIdx.x;
// unsigned int z = threadIdx.y;
// int tid = blockDim.x * blockIdx.x + threadIdx.x;
//printf("Ola sou a td %d\n", tid);
for (int i = 0; i<8; i++)
atomicAdd(data+i,10);
}
int main(void)
{
int local[20][8] = { 0 };
int *noLocal[20];
for (int d = 0; d< 20;d++) {
cudaMalloc(&(noLocal[d]), 8*sizeof(int));
cudaMemcpy(noLocal[d], local[d], 8*sizeof(int), cudaMemcpyHostToDevice);
myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
}
for (int d = 0; d< 20;d++)
cudaMemcpy(local[d], noLocal[d], 8*sizeof(int), cudaMemcpyDeviceToHost);
printf("Holaa %d\n", local[0][0]);
for (int d = 0; d < 20; d++)
cudaFree(noLocal[d]);
}
$ nvcc -o t1858 t1858.cu
$ cuda-memcheck ./t1858
========= CUDA-MEMCHECK
Holaa 40000
========= ERROR SUMMARY: 0 errors
$
The number 40000 is correct. It comes about because every thread is doing an atomic add of 10, and you have 20x200 threads that are doing that. 10x20x200 = 40000.

You should simply not be doing anything like that. You are wasting time and memory with these excessive allocations. And - your kernel would be pretty slow as well. I am 100% certain this is not what you were asked, nor what you wanted, to do.
Instead, you should:
Allocate a single large buffer on the device to fit the data you need.
Avoid using pointers on the device side, except to that buffer, unless absolutely necessary.
If you somehow have to use a 2D pointer array - add relevant offsets to your buffer's base pointer to get different pointers into it.

Related

Cuda printf() overlapping when using multiple devices

I have a printf in my __global__ code. It works as intended most of the time. However when using a multi GPU system (typically happens when ran on an 4-8 GPU system), once in a while, the prints will merge. By once in a while Its about 100-500 lines out of 167000 lines.
I was wondering how this situation can be remedied without adding too much overhead of transferring the data back to host (if possible). I was thinking to try a mutex lock for printing but I dont think that sort of thing exists for use in the kernel. Any other solutions I could try?
Note: The actual kernel is a long running kernel usually around 20-50 minutes to complete depending on the GPU.
Note2: I barely know what I'm doing with C/C++.
Example of merged Output
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869fxLuZJu3ZkX
qVOuKlQ0ZcMrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
What it should look like
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869f
MrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a
qVOuKlQ0Zc,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
xLuZJu3ZkX,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
My Example Code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
pthread_mutex_t solutionLock;
__global__ void kernel(unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Print to console - randomStr,Hash */
printf("%s,%s\n", random, data);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
while (1) {
hostRandomGen(&rngSeed);
kernel<<<BLOCKS, THREADS>>>(rngSeed);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
pthread_mutex_lock(&solutionLock);
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I spent 4 days trying different things to no avail. I really don't understand memory management enough in C/C++ to get past the endless segmentation fault errors.
What I ended up doing was using Unified Memory as it seemed the easiest way to handle the memory for both device and host and it doesn't seem to add too much overhead to the whole process. Then each cpu thread (gpu) can write to its own file. I ran a couple of nvprof and it seemed that after the initial setup for the memory cudaMallocManaged the rest of the overhead seemed to be measured in the microseconds. Since each loop takes 20 minutes these are really barely noticeable.
I created two __device__ functions to copy the data over to the host accessible arrays, because I wanted to utilize the #pragma unroll feature. Not really sure if that helps or what it even does, but I decided to do things this way.
If anyone has further suggestions on ways to improve I am open to trying more things out.
Here is my new example code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
__device__ void mycpydigest(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 64
for (BYTE i = 0; i < 64; i++) {
dst[i] = src[i];
}
dst[64] = '\0';
}
__device__ void mycpyrandom(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 10
for (BYTE i = 0; i < 10; i++) {
dst[i] = src[i];
}
dst[10] = '\0';
}
__global__ void kernel(BYTE **d_random, BYTE **d_hashes, unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Send to host - randomStr & Hash */
mycpydigest(d_hashes[idx], data);
mycpyrandom(d_random[idx], random);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
int threadBlocks = hi->BLOCKS * hi->THREADS;
BYTE **randoms;
BYTE **hashes;
cudaMallocManaged(&randoms, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
cudaMallocManaged(&hashes, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
for (int i = 0; i < threadBlocks; i++) {
cudaMallocManaged(&randoms[i], sizeof(BYTE) * (RANDOM_LEN), cudaMemAttachGlobal);
cudaMallocManaged(&hashes[i], sizeof(BYTE) * (DIGEST_LEN), cudaMemAttachGlobal);
}
while (1) {
hostRandomGen(&rngSeed);
kernel<<<hi->BLOCKS, hi->THREADS>>>(randoms, hashes, rngSeed);
cudaDeviceSynchronize();
print2File(randoms, hashes, threadBlocks, hi->device)
}
cudaFree(hashes);
cudaFree(randoms);
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I want to thank #paleonix for the help in the comments. I was working on this issue for a week before I posted and your comments helped guide me down a different path.

Difficulty using atomicMin to find minimum value in a matrix

I'm having trouble using atomicMin to find the minimum value in a matrix in cuda. I'm sure it has something to do with the parameters I'm passing into the atomicMin function. The findMin function is the function to focus on, the popmatrix function is just to populate the matrix.
#include <stdio.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define SIZE 4
__global__ void popMatrix(unsigned *matrix) {
unsigned id, num;
curandState_t state;
id = threadIdx.x * blockDim.x + threadIdx.y;
// Populate matrix with random numbers
curand_init(id, 0, 0, &state);
num = curand(&state)%100;
matrix[id] = num;
}
__global__ void findMin(unsigned *matrix, unsigned *temp) {
unsigned id;
id = threadIdx.x * blockDim.y + threadIdx.y;
atomicMin(temp, matrix[id]);
printf("old: %d, new: %d", matrix[id], temp);
}
int main() {
dim3 block(SIZE, SIZE, 1);
unsigned *arr, *harr, *temp;
cudaMalloc(&arr, SIZE*SIZE*sizeof(unsigned));
popMatrix<<<1,block>>>(arr);
// Print matrix of random numbers to see if min number was picked right
cudaMemcpy(harr, arr, SIZE*SIZE*sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned i = 0; i < SIZE; i++) {
for (unsigned j = 0; j < SIZE; j++) {
printf("%d ", harr[i*SIZE+j]);
}
printf("\n");
}
temp = harr[0];
findMin<<<1, block>>>(harr);
return 0;
}
harr is not allocated. You should allocated it on the host side using for example malloc before calling cudaMemcpy. As a result, the printed values you look are garbage. This is quite surprising that the program did not segfault on your machine.
Moreover, when you call the kernel findMin at the end, its parameter is harr (which is supposed to be on the host side regarding its name) should be on the device to perform the atomic operation correctly. As a result, the current kernel call is invalid.
As pointed out by #RobertCrovella, a cudaDeviceSynchronize() call is missing at the end. Moreover, you need to free your memory using cudaFree.

Having an issue detecting problem in my code | CUDA c

I'm having a hard time understanding where is the bug in my code.
The purpose of the project is to multiply two matrices and compare the time between sequential and parallel.
When I'm printing the Matrices I see that the device matrix is basically empty.
Also, I treated the matrices as an array of size n*n .
Thanks!
//This program computes the multiplication of two Matrices GPU using CUDA
#include <stdio.h>
#include <cassert>
__global__ void matrixMul(int * m,int * n,int * p,int size)
{
//Calculate Row and Column
int row=threadIdx.y*blockDim.y+threadIdx.y;
int column=threadIdx.x*blockDim.x+threadIdx.x;
int p_sum=0;
for (int i = 0; i < size; i++)
{
p_sum += m[row*size + i] * n[i*size +column];
}
p[row*size + column] = p_sum;
}
void matrixMul_seq(int * m,int * n,int * p,int size){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
for(int k = 0; k < size; k++){
p[i*size +j] += m[i*size +k] +n[k*size +j];
}
}
}
}
//Initialize matricies
void init_matricies(int * mat,int n){
for(int i = 0; i < n; i++) {
for (int j = 0; j < n; j++)
{
mat[i*n+j]=rand()%1024;
}
}
}
int main(int argc,char **argv)
{
//Set our problem Size(Default = 2^10 == 1024)
int n = 1<<10;
printf("Square Matrix of size:%d\n",n);
//Size in Bytes
size_t bytes=n*n*sizeof(bytes);
//Host matricies
int *h_m;
int *h_p;
int *h_n;
int *h_p_seq;
//Host matricies
int *d_m;
int *d_p;
int *d_n;
//Memory allocation for Host Matricies
h_m=(int*)malloc(bytes);
h_n=(int*)malloc(bytes);
h_p=(int*)malloc(bytes);
h_p_seq=(int*)malloc(bytes);
init_matricies(h_m,n);
init_matricies(h_n,n);
//Allocate memory on device side
cudaMalloc(&d_n, bytes);
cudaMalloc(&d_m, bytes);
cudaMalloc(&d_p, bytes);
//Copy data to Device
cudaMemcpy(d_m,h_m, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_n,h_n, bytes, cudaMemcpyHostToDevice);
int threads_per_block =16;
dim3 block_size(threads_per_block,threads_per_block);
dim3 grid_size( n / block_size.x , n / block_size.y);
printf("Grid size X:%d, Grid size y:%d\n",grid_size.x,grid_size.y);
printf("THE RESULT OF THE SIZES: 2^6 * 2^4 * 2^6 * 2^4 \n");
matrixMul <<<grid_size,block_size>>>(d_m,d_n,d_p,n);
matrixMul_seq(h_m,h_n,h_p_seq,n);
cudaMemcpy(h_p,d_p, bytes, cudaMemcpyDeviceToHost);
for(int i = 0; i < n; i++){
for(int j = 0; j < n; j++){
//printf("Grid size X:%d, Grid size y:%d\n",h_p[ n * i + j],h_p_seq[ n * i + j]);
assert(h_p[ n * i + j]==h_p_seq[ n * i + j]);
}
}
free(h_m);
free(h_p);
free(h_n);
free(h_p_seq);
cudaFree(d_m);
cudaFree(d_n);
cudaFree(d_p);
return 0;
}
You have a variety of problems in your code:
You are calculating kernel index variables incorrectly. This is incorrect:
int row=threadIdx.y*blockDim.y+threadIdx.y;
int column=threadIdx.x*blockDim.x+threadIdx.x;
it should be:
int row=blockIdx.y*blockDim.y+threadIdx.y;
int column=blockIdx.x*blockDim.x+threadIdx.x;
The matrix operations in your calculation functions don't match each other. Kernel:
p_sum += m[row*size + i] * n[i*size +column];
^
multiplication
host code:
p[i*size +j] += m[i*size +k] +n[k*size +j];
^
addition
we also observe, from above, that the host code is doing a summation to the output variable (+=), whereas the the kernel is doing an assignment to the output variable (=):
p[row*size + column] = p_sum;
This has implications for the next issue.
malloc doesn't initialize data. Since this operation is creating the output array that will be used by the host code, which is doing a summation to it, we must initialize this allocation to zero:
h_p_seq=(int*)malloc(bytes);
memset(h_p_seq, 0, bytes); // must add this line to initialize to zero
The calculation of the size of your arrays in bytes is too large. You have defined your arrays to be of type int. But your size calculation is like this:
size_t bytes=n*n*sizeof(bytes);
An int is a 4-byte quantity, whereas a size_t variable like bytes is an 8-byte quantity. This doesn't cause an actual problem, but is unnecessary. I would suggest changing it to:
size_t bytes=n*n*sizeof(int);
With the above items addressed, your code runs correctly for me.

Segmentation Fault with 3D array

I am trying to work with 3D arrays in CUDA (200x200x100).
The moment I change my z dimension (model_num) from 4 to 5, I get a segmentation fault. Why, and how can I fix it?
const int nrcells = 200;
const int nphicells = 200;
const int model_num = 5; //So far, 4 is the maximum model_num that works. At 5 and after, there is a segmentation fault
__global__ void kernel(float* mgridb)
{
const unsigned long long int i = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
if(tx >= 0 && tx < nphicells && ty >=0 && ty < nrcells && tz >= 0 && tz < model_num){
//Do stuff with mgridb[i]
}
}
int main (void)
{
unsigned long long int size_matrices = nphicells*nrcells*model_num;
unsigned long long int mem_size_matrices = sizeof(float) * size_matrices;
float *h_mgridb = (float *)malloc(mem_size_matrices);
float mgridb[nphicells][nrcells][model_num];
for(int k = 0; k < model_num; k++){
for(int j = 0; j < nrcells; j++){
for(int i = 0; i < nphicells; i++){
mgridb[i][j][k] = 0;
}
}
}
float *d_mgridb;
cudaMalloc( (void**)&d_mgridb, mem_size_matrices );
cudaMemcpy(d_mgridb, h_mgridb, mem_size_matrices, cudaMemcpyHostToDevice);
int threads = nphicells;
uint3 blocks = make_uint3(nrcells,model_num,1);
kernel<<<blocks,threads>>>(d_mgridb);
cudaMemcpy( h_mgridb, d_mgridb, mem_size_matrices, cudaMemcpyDeviceToHost);
cudaFree(d_mgridb);
return 0;
}
This is getting stored on the stack:
float mgridb[nphicells][nrcells][model_num];
Your stack space is limited. When you exceed the amount you can store on the stack, you are getting a seg fault, either at the point of allocation, or as soon as you try and access it.
Use malloc instead. That allocates heap storage, which has much higher limits.
None of the above has anything to do with CUDA. Furthermore its not unique or specific to "3D" arrays. Any large stack based allocation (e.g. 1D array) is going to have the same trouble.
You may also have to adjust how you access the array, but it's not difficult to handle a flattened array using pointer indexing.
Your code is actually strange looking, because you are creating an appropriately sized array h_mgridb using malloc and then copying that array to the device (into d_mgridb). It's not clear what purpose mgridb serves in your code. h_mgridb and mgridb are not the same.

Read an Array With Threads in CUDA

I was wondering if it was possible, and what was the best way to read cells from an array with threads in CUDA. To simplify what I mean this is an example :
I have an array : {1,2,3,4,5,6,...} and I would like each threads to read n cells of my array depending mainly of its size.
I have been trying a few things, but it seems not to work, so if anyone could point out a (right) way to do it, that would be great.
Thank you.
Generally you want contiguous threads to read contiguous array indices. Doing so results in "coalesced" memory transactions. The simple way to think of it is that if 32 threads are running physically in parallel, and they all do a load, then if all 32 loads fall into the same cache line, then a single memory access can be performed to fill the cache line, rather than 32 separate ones.
So what you want to do is have each thread access n cells that are strided by the number of threads, like this (assuming input data is in the float array data).
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < numElements; i += stride) {
float element = data[i];
process(element);
}
If your algorithm requires that each thread reads n contiguous data elements, then you are going to incur non-coalesced loads, which will be much more expensive. In this case, I would consider re-designing the algorithm so this type of access is not required.
You need to:
the threads have to look at the n next numbers
So you can use:
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
// call the kernel by:
accessArray<<<NTHREAD/256, 256>>>(d_array);
dump out the array and check whether it is how you want your thread work or not.
Full code:
#include <cuda.h>
#include <stdio.h>
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
int main()
{
int h_array[ARRAYSIZE];
int *d_array;
size_t memsize= ARRAYSIZE * sizeof(float);
for (int i=0; i< ARRAYSIZE; i++) {
h_array[i] = 0;
}
cudaMalloc(&d_array, memsize);
cudaMemcpy(d_array, h_array, memsize, cudaMemcpyHostToDevice);
accessArray<<<NTHREAD/256, 256>>>(d_array);
cudaMemcpy(h_array, d_array, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<ARRAYSIZE; i++)
printf("A[%d] => %d\n",i,h_array[i]);
cudaFree(d_array);
}