cuda grid 2-Dimension Thread Identifier - cuda

hy, i have 2Dimendions Grid and 1Dimensions block:
dim3 dimGrid(K,N);
dim3 dimBlock(F);
How can i calculate the unique thread identifier?
thanks
EDIT:
sorry, the dimBlock is not K. F different K different N

The local thread Id:
unsigned ltid = threadIdx.x; // Varies from 0 to K-1
The number of blocks can be calculated by:
unsigned num_blocks = blockIdx.y * gridDim.x + blockIdx.x;
The number of threads before the current block:
unsigned boff = num_blocks * blockDim.x; // Multiples of K * N * K
Adding the current thread Id to the number of threads before the current block, you can get the global unique id.
unsigned gtid = ltid + boff;
EDIT
Modified the answer. The original was written under the wrong assumptions.

Purely for the sake of clarity (the other answers are correct as well, but I find this approach more conducive to learning), the global index of any given thread for 2D blocks and grids can be found via:
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int grid_width = gridDim.x * blockDim.x;
//get the global index
int global_idx = index_y * grid_width + index_x;
This may be useful for if you ever introduce a second dimension for your block size, as it'll handle that case automatically.

The calculation I would use would be something like this:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
You may also be interested in the answer I posted to this question.

Related

finding thread index and block index in CUDA

The following code computes the sum of two vectors:
// Compute vector sum C = A+B
for (i = 0, i < 1000, i++)
C[i] = A[i] + B[i]
The grid consists of 20 one-dimensional blocks and the block size (blockDim.x) is 50.
The iteration with i=400 will be assigned a thread. Can anyone help me with how to find threadIdx.x and blockIdx.x of this thread?
threadIdx.x and blockIdx.x inside your kernel will give you exactly that.
In your case you can calculate global index by:
int threadID = blockIdx.x * blockDim.x + threadIdx.x;

Calculating indices for nested loops in CUDA

I'm trying to learn CUDA and I'm a bit confused about calculating thread indices. Let's say I have this loop I'm trying to parallelize:
...
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
In PyCUDA, I set:
block = (8, 8, 8)
grid = (96, 96, 16)
Most of the examples I've seen for parallelizing loops calculate thread indices like this:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1)
DIM_x = 580, DIM_y = 550, psize = 50
However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.
Instead, if I use this (3D grid of 3D blocks):
int blockId = blockIdx.x + blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
int x = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x) + threadIdx.x;
It fixes the multiple same thread Ids problem for x, but I'm not sure how I'd parallelize y and dx.
If anyone could help me understand where I'm going wrong, and show me the right way to parallelize the loops, I'd really appreciate it.
However, if I print x, I see that multiple threads with the same
thread Id are created, and the final result is wrong.
It would be normal for you to see multiple threads with the same x thread ID in a multi-dimensional grid, as it would also be normal to observe many iterations of the loops in your host code with the same x value. If the result is wrong, it has nothing to do with any of the code you have shown, viz:
#include <vector>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <assert.h>
void host(int* array, int DIM_x, int DIM_y, int psize)
{
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
}
__global__
void kernel(int* array, int DIM_x, int DIM_y, int psize)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1);
}
int main()
{
dim3 block(8, 8, 8);
dim3 grid(96, 96, 16);
int DIM_x = 580, DIM_y = 550, psize = 50;
std::vector<int> array_h(DIM_x * DIM_y * psize, 0);
std::vector<int> array_hd(DIM_x * DIM_y * psize, 0);
thrust::device_vector<int> array_d(DIM_x * DIM_y * psize, 0);
kernel<<<grid, block>>>(thrust::raw_pointer_cast(array_d.data()), DIM_x, DIM_y, psize);
host(&array_h[0], DIM_x, DIM_y, psize);
thrust::copy(array_d.begin(), array_d.end(), array_hd.begin());
cudaDeviceSynchronize();
for(int i=0; i<DIM_x * DIM_y * psize; i++) {
assert( array_h[i] == array_hd[i] );
}
return 0;
}
which when compiled and run
$ nvcc -arch=sm_52 -std=c++11 -o looploop loop_the_loop.cu
$ cuda-memcheck ./looploop
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
emits no errors and passes the check of all elements against the host code in your question.
If you are getting incorrect results, it is likely that you have a problem with initialization of the device memory before running the kernel. Otherwise I fail to see how incorrect results could be emitted by the code you have shown.
In general, performing a large number of atomic memory transactions, as your code does, is not the optimal way to perform computation on the GPU. Using non-atomic transactions would probably need to rely on other a priori information about the structure of the problem (such as a graph decomposition or a precise description of the write patterns of the problem).
In a 3D grid with 3D blocks, the thread ID is:
unsigned long blockId = blockIdx.x
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
unsigned long threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
Not the x you computed. The x is only the x index of that 3D matrix.
There is a nice cheatsheet in this blog

Issues accessing an array based on an offset in CUDA

This question more than likely has a simple solution.
Each of the threads I spawn are to be initialized to a starting value. For example, if I have a character set, char charSet[27] = "abcdefghijklmnopqrstuvwxyz", I spawn 26 threads. So threadIdx.0 corresponds to charSet[0] = a. Simple enough.
I thought I figured out a way to do this, until I examined what my threads were doing...
Here's an example program that I wrote:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
__global__ void example(int offset, int reqThreads){
//Declarations
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < reqThreads){
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x; //Used to initialize array <-----Problem is here
printf("%d ", tid);
}
}
int main(void){
//Declarations
int minLength = 1;
int maxLength = 2;
int offset;
int totalThreads;
int reqThreads;
int base = 26;
int maxThreads = 512;
int blocks;
int i,j;
for(i = minLength; i<=maxLength; i++){
offset = i;
//Calculate parameters
reqThreads = (int) pow((double) base, (double) offset); //Casting I would never do, but works here
totalThreads = reqThreads;
for(j = 1;(totalThreads % maxThreads) != 0; j++) totalThreads += 1; //Create a multiple of 512
blocks = totalThreads/maxThreads;
//Call the kernel
example<<<blocks, totalThreads>>>(offset, reqThreads);
cudaThreadSynchronize();
printf("\n\n");
}
system("pause");
return 0;
}
My reasoning was that this statement
unsigned int tid = (offset * threadIdx.x) + blockIdx.x * blockDim.x;
would allow me to introduce an offset. If offset were 2, threadIdx.0 * offset = 0, threadIdx.1 * offset = 2, threadIdx.2 * offset = 4, and so forth. That definitely does not happen. The output of the above program works when offset == 1:
0 1 2 3 4 5...26.
But when offset == 2:
1344 1346 1348 1350...
In fact, those values are way outside the bounds of my array. So I'm not sure what is going wrong.
Any constructive input is appreciated.
I believe your kernel call should look like this:
example<<<blocks, maxThreads>>>(offset, reqThreads);
Your intent is to launch thread blocks of 512 threads, so that number (maxThreads) should be your second kernel config parameter, which is the number of threads per block.
Also, this is deprecated:
cudaThreadSynchronize();
Use this instead:
cudaDeviceSynchronize();
And if you use printf from the kernel for a large amount of output, you can lose some of the output if you exceed the buffer.
Finally, I'm not sure your reasoning is correct about the range of indices that will be printed.
When offset = 2 (the second pass through the loop), then 26^2 = 676, and you will then end up with 1024 threads, (in 2 thread blocks, if you make the above fixes). The second threadblock will have
tid = (2*threadIdx.x) + blockDim.x*blockIdx.x;
(0..164) (512) (1)
So the second threadblock should print out indices of 512 (minimum) up to (2*164) + 512 = 900
(164 = 675 - 511)
The first threadblock should print out indices of:
tid = (2*threadIdx.x) + blockDim.x * blockIdx.x
(0..511) (512) (0)
i.e. 0 to 1022

Optimize the kernel function based on parellel reduction

In one of my previous posts I asked how it was possible to improve a kernel function. The kernel compute the squared euclidean distance between the corresponding rows of two equal sized matrices. Eric gave a very good tip to use one thread block per row and after that apply parallel reduction. Before continue with further details this post is made because I did not want to make more complicated the previous post and I give my thanks to Eric. Below I attached the .cu code which is not give me the correct results.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int c = blockDim.x * blockIdx.x + threadIdx.x; // rows
unsigned int r = blockDim.y * blockIdx.y + threadIdx.y; // cols
sdata[ tid ] = ( A[ r*cols + c ] - B[ r*cols + c ] ) * ( A[ r*cols + c ] - B[ r*cols + c ] );
__syncthreads();
for ( unsigned int s = 1; s < blockDim.x; s*=2 ){
if ( tid % (2*s) == 0 ){
sdata[ tid ] += sdata[ tid + s ];
}
}
__syncthreads();
if ( tid == 0) C[blockIdx.x]=sdata[0];
}
The code is based on the http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf. It is not the optimized version. I am just want to catch the basic point. I think that there is a problem where I initialize the sdata. Also the initialization of the kernel is done by this way:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
dim3 dimBlock(1, threadsPerBlock);
dim3 dimGrid(blocksPerGrid, 1);
cudaEuclid<<<dimGrid, dimBlock>>>( d_A, d_B, d_C, rows, cols );
Thank you and sorry for my ignorance.
You're using dynamically allocated shared memory, yet you're not actually allocating any shared memory. The kernel launch should have an additional parameter for the size of shared memory per block.
cudaEuclid<<<dimGrid, dimBlock, threadsPerBlock*sizeof(float)>>>( d_A, d_B, d_C, rows, cols );
Consider using CUB for reduction - saves you from reimplementing from scratch and is tuned.
If you want to code it yourself, there's a more recent version of the example than the version from CUDA 1.1-beta!
sdata[ tid ] += sdata[ tid ]; ==> you are just adding the same value twice
you need to do
sdata[tid] += sdata[tid +s ]

Calculating differences between consecutive indices fast

Given that I have the array
Let Sum be 16
dintptr = { 0 , 2, 8,11,13,15}
I want to compute the difference between consecutive indices using the GPU. So the final array should be as follows:
count = { 2, 6,3,2,2,1}
Below is my kernel:
//for this function n is 6
__global__ void kernel(int *dintptr, int * count, int n){
int id = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int indexes[256];
int need = (n % 256 ==0)?0:1;
int allow = 256 * ( n/256 + need);
while(id < allow){
if(id < n ){
indexes[threadIdx.x] = dintptr[id];
}
__syncthreads();
if(id < n - 1 ){
if(threadIdx.x % 255 == 0 ){
count[id] = indexes[threadIdx.x + 1] - indexes[threadIdx.x];
}else{
count[id] = dintptr[id+1] - dintptr[id];
}
}//end if id<n-1
__syncthreads();
id+=(gridDim.x * blockDim.x);
}//end while
}//end kernel
// For last element explicitly set count[n-1] = SUm - dintptr[n-1]
2 questions:
Is this kernel fast. Can you suggest a faster implementation?
Does this kernel handle arrays of arbitrary size ( I think it does)
I'll bite.
__global__ void kernel(int *dintptr, int * count, int n)
{
for (int id = blockDim.x * blockIdx.x + threadIdx.x;
id < n-1;
id += gridDim.x * blockDim.x)
count[id] = dintptr[id+1] - dintptr[i];
}
(Since you said you "explicitly" set the value of the last element, and you didn't in your kernel, I didn't bother to set it here either.)
I don't see a lot of advantage to using shared memory in this kernel as you do: the L1 cache on Fermi should give you nearly the same advantage since your locality is high and reuse is low.
Both your kernel and mine appear to handle arbitrary-sized arrays. Yours however appears to assume blockDim.x == 256.