2D threads in CUDA - cuda

I'm trying to use 2D threads in CUDA. threadIDx.x and blockIdx.x work fine, but threadIdx.y and blockIdx.y don't work. The .y ones are always 0.
Here is my code:
#define N 16
__global__ void add(int* a) {
int i=threadIdx.x;
int j=threadIdx.y;
a[i] = j;
}
int main(int argc, char **argv)
{
int a[N];
const int size = N*sizeof(int);
int *da;
cudaMalloc((void**)&da, size);
add<<<1, N>>>(da);
cudaMemcpy(a, da, size, cudaMemcpyDeviceToHost);
printf("Thread indices:\n");
for(int i=0;i<N;i++)
{
printf("%d ", a[i]);
}
cudaFree(da);
return 0;
}
The result for a[i] = j; or a[j] = j;
Thread indices:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
and for a[i] = i;
Thread indices:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I tried using
#define M 4
#define N 4
...
int i = (blockDim.x * blockIdx.x) + threadIdx.x;
int j = (blockDim.y * blockIdx.y) + threadIdx.y;
...
add<<<M, N>>>(da);
...
and result is same: .x ones are fine but .y ones are all 0. Can anyone help me fixing this? Thanks

You are confusing blocks and threads with dimensions.
add <<<M,N>>> is interpreted as add<<<dim3(M,1,1),dim3(N,1,1)>>> where M is the number of blocks and N is the number of threads per kernel.
If you want to have MxN blocks with MxN threads call add<<<dim3(M,N),dim3(M,N)>>>
I would recommend Udacity CUDA course for beginners, it is very beginner friendly.
I want M blocks with N threads per block.
Well then add<<<M,N>>> is correct but it is 1 dimensional, there is no y to it. If you want to locate the thread use this code.
int index = threadIdx.x + blockDim.x * blockIdx.x
There is no y in it. The entire thing is 1D. Each block can only have a limited number of threads (64 or 128 usually) that is why threads and blocks are separated. There are a lot of nuances to it. I would recommend the Udacity course it helped me a lot.

Related

Is one CUDA block dimension faster than the other?

I have a simple CUDA code that assigns the values of an NxN matrix A to matrix B. In one case, I declare block sizes block(1,32) and have each thread loop over the entries in the first matrix dimension. In the second case,
I declare block sizes block(32,1) and have each thread loop over entries in the
second matrix dimension.
Is there some really obvious reason why, in my code below, threads that loop over the stride 1 memory are significantly slower than those that the loop over stride N memory? I would have thought it was the other way around (if there is any difference at all).
Am I missing something really obvious (a bug, perhaps)?
The complete code is below.
#include <stdio.h>
#include <sys/time.h>
__global__ void addmat_x(int m, int n, int* A, int *B)
{
int idx, ix;
int iy = threadIdx.y + blockIdx.y*blockDim.y;
if (iy < n)
for(ix = 0; ix < m; ix++) {
idx = iy*m + ix; /* iy*m is constant */
B[idx] = A[idx];
}
}
__global__ void addmat_y(int m, int n, int* A, int *B)
{
int ix = threadIdx.x + blockIdx.x*blockDim.x;
int idx, iy;
if (ix < m)
for(iy = 0; iy < n; iy++) {
idx = iy*m + ix;
B[idx] = A[idx];
}
}
double cpuSecond()
{
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
int main(int argc, char** argv)
{
int *A, *B;
int *dev_A, *dev_B;
size_t m, n, nbytes;
double etime, start;
m = 1 << 14;
n = 1 << 14;
nbytes = m*n*sizeof(int);
A = (int*) malloc(nbytes);
B = (int*) malloc(nbytes);
memset(A,0,nbytes);
cudaMalloc((void**) &dev_A, nbytes);
cudaMalloc((void**) &dev_B, nbytes);
cudaMemcpy(dev_A, A, nbytes, cudaMemcpyHostToDevice);
#if 1
/* One thread per row */
dim3 block(1,32);
dim3 grid(1,(n+block.y-1)/block.y);
start = cpuSecond();
addmat_x<<<grid,block>>>(m,n,dev_A, dev_B);
#else
/* One thread per column */
dim3 block(32,1);
dim3 grid((m+block.x-1)/block.x,1);
start = cpuSecond();
addmat_y<<<grid,block>>>(m,n,dev_A, dev_B);
#endif
cudaDeviceSynchronize();
etime = cpuSecond() - start;
printf("GPU Kernel %10.3g (s)\n",etime);
cudaFree(dev_A);
cudaFree(dev_B);
free(A);
free(B);
cudaDeviceReset();
}
Lets compare the global memory indexing generated by each thread, in each case.
addmat_x:
Your block dimension is (1,32). This means 1 thread wide in x, 32 threads "long" in y. The threadId.x value for each thread will be 0. The threadIdx.y value for the threads in the warp will range from 0 to 31, as you move from thread to thread in the warp. With that, let's inspect your creation of idx in that kernel:
m = 1 << 14;
...
int iy = threadIdx.y + blockIdx.y*blockDim.y;
idx = iy*m + ix;
let's choose the first block, whose blockIdx.y is 0. Then:
idx = threadIdx.y*(1<<14) + ix;
For the first loop iteration, ix is 0. The idx values generated by each thread will be:
threadIdx.y: | idx:
0 0
1 (1<<14)
2 2*(1<<14)
...
31 31*(1<<14)
For a given loop iteration, the distance from the load or store index from one thread to the next will be 1<<14. i.e. not adjacent. Scattered.
addmat_y:
Your block dimension is (32,1). This means 32 threads wide in x, 1 thread "long" in y. The threadIdx.y value for each thread will be 0. The threadIdx.x value for the threads in the warp will range from 0 to 31, as you move from thread to thread. Now let's inspect your creation of idx in that kernel:
m = 1 << 14;
...
int ix = threadIdx.x + blockIdx.x*blockDim.x;
idx = iy*m + ix;
Let's choose the first block, whose blockIdx.x is 0. Then:
idx = iy*m + threadIdx.x;
For the first loop iteration, iy is 0, so we simply have:
idx = threadIdx.x;
This generates the following index pattern across the warp:
threadIdx.x: | idx:
0 0
1 1
2 2
...
31 31
These indices are adjacent, it is not a scattered load or store, the addresses will coalesce nicely, and this represents "efficient" use of global memory. It will perform faster than the first case.

Issues accessing an array based on an offset in CUDA

This question more than likely has a simple solution.
Each of the threads I spawn are to be initialized to a starting value. For example, if I have a character set, char charSet[27] = "abcdefghijklmnopqrstuvwxyz", I spawn 26 threads. So threadIdx.0 corresponds to charSet[0] = a. Simple enough.
I thought I figured out a way to do this, until I examined what my threads were doing...
Here's an example program that I wrote:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
__global__ void example(int offset, int reqThreads){
//Declarations
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < reqThreads){
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here
printf("%d ", tid);
}
}
int main(void){
//Declarations
int minLength = 1;
int maxLength = 2;
int offset;
int totalThreads;
int reqThreads;
int base = 26;
int maxThreads = 512;
int blocks;
int i,j;
for(i = minLength; i<=maxLength; i++){
offset = i;
//Calculate parameters
reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here
totalThreads = reqThreads;
for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1; //Create a multiple of 512
blocks = totalThreads/maxThreads;
//Call the kernel
example<<<blocks, totalThreads>>>(offset, reqThreads);
cudaThreadSynchronize();
printf("\n\n");
}
system("pause");
return 0;
}
My reasoning was that this statement
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;
would allow me to introduce an offset. If offset were 2, threadIdx.0 * offset = 0, threadIdx.1 * offset = 2, threadIdx.2 * offset = 4, and so forth. That definitely does not happen. The output of the above program works when offset == 1:
0 1 2 3 4 5...26.
But when offset == 2:
1344 1346 1348 1350...
In fact, those values are way outside the bounds of my array. So I'm not sure what is going wrong.
Any constructive input is appreciated.
I believe your kernel call should look like this:
example<<<blocks, maxThreads>>>(offset, reqThreads);
Your intent is to launch thread blocks of 512 threads, so that number (maxThreads) should be your second kernel config parameter, which is the number of threads per block.
Also, this is deprecated:
cudaThreadSynchronize();
Use this instead:
cudaDeviceSynchronize();
And if you use printf from the kernel for a large amount of output, you can lose some of the output if you exceed the buffer.
Finally, I'm not sure your reasoning is correct about the range of indices that will be printed.
When offset = 2 (the second pass through the loop), then 26^2 = 676, and you will then end up with 1024 threads, (in 2 thread blocks, if you make the above fixes). The second threadblock will have
tid = (2*threadIdx.x) + blockDim.x*blockIdx.x;
(0..164) (512) (1)
So the second threadblock should print out indices of 512 (minimum) up to (2*164) + 512 = 900
(164 = 675 - 511)
The first threadblock should print out indices of:
tid = (2*threadIdx.x) + blockDim.x * blockIdx.x
(0..511) (512) (0)
i.e. 0 to 1022

Take average over blocks Cuda

Cuda said that the shared memory can only be shared by data in the same block. But a block can only have at most 1024 threads. What if I have a huge matrix, and want to take average of them with maximized threads.
Takes this as an example. (I didn't use the maximized threads in one block, just as a demo)
#include <iostream>
#include <stdio.h>
__global__ void
kernel(int *a, int dimx, int dimy)
{
int ix = blockDim.x * blockIdx.x + threadIdx.x;
int iy = blockDim.y * blockIdx.y + threadIdx.y;
int idx = iy * dimx + ix;
__shared__ int array[64];
a[idx] = a[idx] + 1;
array[idx] = a[idx];
__syncthreads();
int sum=0;
for(int i=0; i<dimx*dimy; i++)
{
sum += array[i];
}
int average = sum/(dimx*dimy+1.0f);
a[idx] = average;
}
int
main()
{
int dimx = 8;
int dimy = 8;
int num_bytes = dimx*dimy*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
for (int i=0; i < dimx*dimy; i++){
*(h_a+i) = i;
}
cudaMalloc( (void**)&d_a, num_bytes );
//cudaMemset( d_a, 0, num_bytes );
cudaMemcpy( d_a, h_a, num_bytes, cudaMemcpyHostToDevice);
dim3 grid, block;
block.x = 4;
block.y = 4;
grid.x = dimx / block.x;
grid.y = dimy / block.y;
kernel<<<grid, block>>>(d_a, dimx, dimy);
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
std::cout << "the array a is:" << std::endl;
for (int row = 0; row < dimy; row++)
{
for (int col =0; col < dimx; col++)
{
std::cout << h_a[row * dimx + col] << " ";
}
std::cout << std::endl;
}
free(h_a);
cudaFree(d_a);
}
I create four blocks, and want to the results to be average of all of them. Now the result is:
the array a is:
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
3 3 3 3 4 4 4 4
11 11 11 11 12 12 12 12
11 11 11 11 12 12 12 12
11 11 11 11 12 12 12 12
11 11 11 11 12 12 12 12
Each block has its own average, rather overall average. How could I take the average over all the blocks?
I'm new to Cuda. Any relevant answer is welcomed.
The easiest way is to launch multiple kernels, such that you do your per-block average, write those out to global memory, then launch another kernel to work on the per-block results from the previous kernel. Depending on your data dimensions you might have to repeat this multiple times.
e.g. (in pseudo-code)
template <typename T>
__global__ reduce(T* data, T* block_avgs)
{
//find the per-block average, write it out to block_avgs
//...
}
//in your caller:
loop while you have more than 1 block:
call kernel using result from prev. iteration
update grid_dim and block_dim
This is necessary as there's no inter-block synchronization in CUDA. Your problem is a pretty straightforward application of reduction. Take a look at the parallel reduction sample at the nvidia samples page to get a better feel for reductions.

Relation between number of blocks of threads

This code is from a book called cuda by examples
#include "../common/book.h"
#define N (33 * 1024)
__global__ void add( int *a, int *b, int *c ) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
.
.
.
add<<<128,128>>>( dev_a, dev_b, dev_c );
33*1024 = 33792
128 * 128 = 16384
33792 > 16384
So, can I have to increase the number of threads per blocks in this case to run?
Notice the second command in the body of while-cycle, i.e. tid += blockDim.x * gridDim.x;. It does the stuff even for bigger arrays than 16384.
Thread with ID 0 sums the items of arrays in the positions 0, 16384, 32768,...
Thread with ID 1 sums the items of arrays in the positions 1, 16385, 32769,...

CUDA loop over lower triangular matrix

If have a matrix and I only want to access to the lower triangular part of the matrix. I am trying to find a good thread index but so far I have not managed it. Any ideas?
I need and index to loop over the lower triangular matrix, say this is my matrix
1 2 3 4
5 6 7 8
9 0 1 2
3 5 6 7
the index should go for
1
5 6
9 0 1
3 5 6 7
in this example, positions 0,4,5,8,9,10,12,13,14,15 of a 1D array.
The CPU loop is:
for(i = 0; i < N; i++){
for(j = 0; j <= i; j++){
.......
where N is the number of rows. I was trying something in the kernel:
__global__ void Kernel(int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row+col);
}
and then call it this way:
dim3 Blocks(1,1);
dim3 Threads(N,N);
Kernel<<< Blocks, Threads>>>(N);
but it doesn't work at all.
What I get:
0
1
2
2
3
4
You're launching a grid of threads and then disabling all those above the diagonal, i.e. ~50% of threads will do nothing which is very inefficient.
The simple fix for your code is to fix the index:
__global__ void Kernel(int N)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row * N + col);
}
Perhaps a more efficient, but more complex, solution would be to launch the correct number of threads and convert the index. Check out this answer for starting points...
The problem is that we are indexing a 1D array so in order to map it we need to multiply the row index with the number of columns, therefore following the example:
__global__ void Kernel(int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row*N + col);
}