finding thread index and block index in CUDA - cuda

The following code computes the sum of two vectors:
// Compute vector sum C = A+B
for (i = 0, i < 1000, i++)
C[i] = A[i] + B[i]
The grid consists of 20 one-dimensional blocks and the block size (blockDim.x) is 50.
The iteration with i=400 will be assigned a thread. Can anyone help me with how to find threadIdx.x and blockIdx.x of this thread?

threadIdx.x and blockIdx.x inside your kernel will give you exactly that.
In your case you can calculate global index by:
int threadID = blockIdx.x * blockDim.x + threadIdx.x;

Related

Issues accessing an array based on an offset in CUDA

This question more than likely has a simple solution.
Each of the threads I spawn are to be initialized to a starting value. For example, if I have a character set, char charSet[27] = "abcdefghijklmnopqrstuvwxyz", I spawn 26 threads. So threadIdx.0 corresponds to charSet[0] = a. Simple enough.
I thought I figured out a way to do this, until I examined what my threads were doing...
Here's an example program that I wrote:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
__global__ void example(int offset, int reqThreads){
//Declarations
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < reqThreads){
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here
printf("%d ", tid);
}
}
int main(void){
//Declarations
int minLength = 1;
int maxLength = 2;
int offset;
int totalThreads;
int reqThreads;
int base = 26;
int maxThreads = 512;
int blocks;
int i,j;
for(i = minLength; i<=maxLength; i++){
offset = i;
//Calculate parameters
reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here
totalThreads = reqThreads;
for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1; //Create a multiple of 512
blocks = totalThreads/maxThreads;
//Call the kernel
example<<<blocks, totalThreads>>>(offset, reqThreads);
cudaThreadSynchronize();
printf("\n\n");
}
system("pause");
return 0;
}
My reasoning was that this statement
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;
would allow me to introduce an offset. If offset were 2, threadIdx.0 * offset = 0, threadIdx.1 * offset = 2, threadIdx.2 * offset = 4, and so forth. That definitely does not happen. The output of the above program works when offset == 1:
0 1 2 3 4 5...26.
But when offset == 2:
1344 1346 1348 1350...
In fact, those values are way outside the bounds of my array. So I'm not sure what is going wrong.
Any constructive input is appreciated.
I believe your kernel call should look like this:
example<<<blocks, maxThreads>>>(offset, reqThreads);
Your intent is to launch thread blocks of 512 threads, so that number (maxThreads) should be your second kernel config parameter, which is the number of threads per block.
Also, this is deprecated:
cudaThreadSynchronize();
Use this instead:
cudaDeviceSynchronize();
And if you use printf from the kernel for a large amount of output, you can lose some of the output if you exceed the buffer.
Finally, I'm not sure your reasoning is correct about the range of indices that will be printed.
When offset = 2 (the second pass through the loop), then 26^2 = 676, and you will then end up with 1024 threads, (in 2 thread blocks, if you make the above fixes). The second threadblock will have
tid = (2*threadIdx.x) + blockDim.x*blockIdx.x;
(0..164) (512) (1)
So the second threadblock should print out indices of 512 (minimum) up to (2*164) + 512 = 900
(164 = 675 - 511)
The first threadblock should print out indices of:
tid = (2*threadIdx.x) + blockDim.x * blockIdx.x
(0..511) (512) (0)
i.e. 0 to 1022

cuda grid 2-Dimension Thread Identifier

hy, i have 2Dimendions Grid and 1Dimensions block:
dim3 dimGrid(K,N);
dim3 dimBlock(F);
How can i calculate the unique thread identifier?
thanks
EDIT:
sorry, the dimBlock is not K. F different K different N
The local thread Id:
unsigned ltid = threadIdx.x; // Varies from 0 to K-1
The number of blocks can be calculated by:
unsigned num_blocks = blockIdx.y * gridDim.x + blockIdx.x;
The number of threads before the current block:
unsigned boff = num_blocks * blockDim.x; // Multiples of K * N * K
Adding the current thread Id to the number of threads before the current block, you can get the global unique id.
unsigned gtid = ltid + boff;
EDIT
Modified the answer. The original was written under the wrong assumptions.
Purely for the sake of clarity (the other answers are correct as well, but I find this approach more conducive to learning), the global index of any given thread for 2D blocks and grids can be found via:
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int grid_width = gridDim.x * blockDim.x;
//get the global index
int global_idx = index_y * grid_width + index_x;
This may be useful for if you ever introduce a second dimension for your block size, as it'll handle that case automatically.
The calculation I would use would be something like this:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
You may also be interested in the answer I posted to this question.

CUDA loop over lower triangular matrix

If have a matrix and I only want to access to the lower triangular part of the matrix. I am trying to find a good thread index but so far I have not managed it. Any ideas?
I need and index to loop over the lower triangular matrix, say this is my matrix
1 2 3 4
5 6 7 8
9 0 1 2
3 5 6 7
the index should go for
1
5 6
9 0 1
3 5 6 7
in this example, positions 0,4,5,8,9,10,12,13,14,15 of a 1D array.
The CPU loop is:
for(i = 0; i < N; i++){
for(j = 0; j <= i; j++){
.......
where N is the number of rows. I was trying something in the kernel:
__global__ void Kernel(int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row+col);
}
and then call it this way:
dim3 Blocks(1,1);
dim3 Threads(N,N);
Kernel<<< Blocks, Threads>>>(N);
but it doesn't work at all.
What I get:
0
1
2
2
3
4
You're launching a grid of threads and then disabling all those above the diagonal, i.e. ~50% of threads will do nothing which is very inefficient.
The simple fix for your code is to fix the index:
__global__ void Kernel(int N)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row * N + col);
}
Perhaps a more efficient, but more complex, solution would be to launch the correct number of threads and convert the index. Check out this answer for starting points...
The problem is that we are indexing a 1D array so in order to map it we need to multiply the row index with the number of columns, therefore following the example:
__global__ void Kernel(int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col<=row) )
printf("%d\n", row*N + col);
}

Matrix multiplication using CUDA -- wrong results

I have following kernel code for matrix manipulation. Matrix A = 1*3 and Matrix B = 3*3 resultant Matrix C would be 1*3. In the following code the width would be 3.
__global__void MatrixMulKernel(float* d_M,float* d_N,float* d_P,int Width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row>=Width || col>=Width){ // matrix range
return;
}
float P_val = 0.0f;
for (int k = 0; k < Width; ++k) {
float M_elem = d_M[row * Width + k];
float N_elem = d_N[k * Width + col];
P_val += M_elem * N_elem;
}
d_p[row*Width+col] = P_val;
}
I kernel code is called as follows
int block_size = 32;
dim3 dimGrid(Width/block_size, Width/block_size);
dim3 dimBlock(block_size, block size);
MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P,3);
But I am getting wrong results. I am getting results as zero always.
Can anyone help me please.
The code looks likes its for multiplication of 2 square matrices of same size.
Width is the number of columns of the first matrix.
You have to provide this as an argument to the function.

Calculating differences between consecutive indices fast

Given that I have the array
Let Sum be 16
dintptr = { 0 , 2, 8,11,13,15}
I want to compute the difference between consecutive indices using the GPU. So the final array should be as follows:
count = { 2, 6,3,2,2,1}
Below is my kernel:
//for this function n is 6
__global__ void kernel(int *dintptr, int * count, int n){
int id = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int indexes[256];
int need = (n % 256 ==0)?0:1;
int allow = 256 * ( n/256 + need);
while(id < allow){
if(id < n ){
indexes[threadIdx.x] = dintptr[id];
}
__syncthreads();
if(id < n - 1 ){
if(threadIdx.x % 255 == 0 ){
count[id] = indexes[threadIdx.x + 1] - indexes[threadIdx.x];
}else{
count[id] = dintptr[id+1] - dintptr[id];
}
}//end if id<n-1
__syncthreads();
id+=(gridDim.x * blockDim.x);
}//end while
}//end kernel
// For last element explicitly set count[n-1] = SUm - dintptr[n-1]
2 questions:
Is this kernel fast. Can you suggest a faster implementation?
Does this kernel handle arrays of arbitrary size ( I think it does)
I'll bite.
__global__ void kernel(int *dintptr, int * count, int n)
{
for (int id = blockDim.x * blockIdx.x + threadIdx.x;
id < n-1;
id += gridDim.x * blockDim.x)
count[id] = dintptr[id+1] - dintptr[i];
}
(Since you said you "explicitly" set the value of the last element, and you didn't in your kernel, I didn't bother to set it here either.)
I don't see a lot of advantage to using shared memory in this kernel as you do: the L1 cache on Fermi should give you nearly the same advantage since your locality is high and reuse is low.
Both your kernel and mine appear to handle arbitrary-sized arrays. Yours however appears to assume blockDim.x == 256.