CUDA transpose More Than one Thread - cuda

im trying to do transpose square matrix using tiling (blocks method) via CUDA, i have successfuly done it but onnly when entering one thread per dimension , as below in the Host function :
dim3 dimGrid((nEven + TILE_DIM - 1) / TILE_DIM, (nEven + TILE_DIM - 1) / TILE_DIM, 1);
dim3 dimBlock(1, 1, 1);
considering : nEven size of matrix + TILE_DIM is the tile size block
i have really trouble into understanding how the threads work in GPU, so ive managed to code as the below my kernel which works with only one thread per block :
__global__ void transposeMain(int *idata)
{
__shared__ int tile2[TILE_DIM][TILE_DIM ];
int yy = blockIdx.y * TILE_DIM + threadIdx.y;
int xx = blockIdx.x * TILE_DIM + threadIdx.x;
if (xx < nEven && yy < nEven)
{
for (int i = 0; i < TILE_DIM; i++)
for (int j = 0; j < TILE_DIM; j++)
tile[i][j] = idata[(i + xx)*nEven + (j + yy)];
__syncthreads();
for (int i = 0; i < TILE_DIM; i++)
for (int j = 0; j < TILE_DIM; j++){
temp1 = tile[i][j];
idata[(j + yy)*nEven + (i + xx)] = temp1;
}
}
Please help me how can i manage more than one threads into my tiling, as i feel im missing something , i tried many ways but it keeps getting out of bound memory and gives wrong data,
many thanks

Each thread in a block represents a value in range [0..TILE_DIM-1], in both x and y dimention. Thus, a single instruction working with xx and yy will cover the whole area in your tile. There is no need for additional for loops.

Related

Find max of matrix with window size in CUDA [duplicate]

I just started in CUDA. Now I have a question.
I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of this.
For example if I have 64*64 matrix so I will have 8 small matrix with 8*8 scale and find out 8 max values. Finally I save all max values into new array, but its order always change. I want find solution to keep them in right order
__global__ void calculate_emax_kernel(float emap[],float emax[], int img_height, int img_width,int windows_size)
{
int x_index = blockIdx.x*blockDim.x+threadIdx.x;
int y_index = blockIdx.y*blockDim.y+threadIdx.y;
int num_row_block = img_height/windows_size;
int num_col_block = img_width/windows_size;
__shared__ float window_elements[256];
__shared__ int counter;
__shared__ int emax_count;
if (threadIdx.x == 0) emax_count = 0;
__syncthreads();
int index;
int emax_idx = 0;
if(y_index >= img_height|| x_index >= img_width) return;
for(int i = 0; i < num_row_block; i++)
{
for(int j = 0; j < num_col_block; j++)
{
counter = 0;
if(y_index >= i*windows_size && y_index < (i+1)*windows_size
&& x_index >= j*windows_size && x_index < (j+1)*windows_size)
{
int idx = y_index*img_height + x_index;
index = atomicAdd(&counter, 1);
window_elements[index] = emap[idx];
__syncthreads();
// reduction
unsigned int k = (windows_size*windows_size)/2;
while(k != 0)
{
if(index < k)
{
window_elements[index] = fmaxf(window_elements[index], window_elements[index+k]);
}
k /= 2;
}
if(index == 0)
{
emax[i*num_row_block+j] = window_elements[index];
}
}
__syncthreads();
}
__syncthreads();
}
__syncthreads();
}
This is my configuration
void construct_emax(float *input,float *output, int img_height, int img_width)
{
int windows_size = 4;
float * d_input, * d_output;
cudaMalloc(&d_input, img_width*img_height*sizeof(float));
cudaMalloc(&d_output, img_width*img_height*sizeof(float));
cudaMemcpy(d_input, input, img_width*img_height*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(16,16);
dim3 gridsize;
gridsize.x=(img_width+blocksize.x-1)/blocksize.x;
gridsize.y=(img_height+blocksize.y-1)/blocksize.y;
calculate_emax_kernel<<<gridsize,blocksize>>>(d_input,d_output,img_height,img_width,windows_size);
}
With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.
For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.
For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.
Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.
const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);
In the kernel function, you could
reduce each mxm window to a 1xm vector during global data loading;
use tree reduction method to reduce the 1xm vector to a scalar.
This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.
#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>
__global__ void calculate_emax_kernel(const float* input, float* output,
int height, int width, int win_size,
int out_width) {
const int tid = threadIdx.x;
const int i = blockIdx.y * win_size;
const int j = blockIdx.x * 256 + tid;
const int win_id = j % win_size;
__shared__ float smax[256];
float tmax = -1e20;
if (j < width) {
for (int tile = 0; tile < win_size; tile++) {
if (i + tile < height) {
tmax = max(tmax, input[(i + tile) * width + j]);
}
}
}
smax[tid] = tmax;
for (int shift = win_size / 2; shift > 0; shift /= 2) {
if (win_id < shift) {
smax[tid] = max(smax[tid], smax[tid + shift]);
}
}
if (win_id == 0 && j < width) {
output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
}
}
int main() {
const int height = 1024;
const int width = 1024;
const int m = 4;
thrust::device_vector<float> in(height * width);
thrust::device_vector<float> out(
((height + m - 1) / m) * ((width + m - 1) / m));
dim3 blocksize(256);
dim3 gridsize((width + 255) / 256, (height + m - 1) / m);
assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
calculate_emax_kernel<<<gridsize, blocksize>>>(
thrust::raw_pointer_cast(in.data()),
thrust::raw_pointer_cast(out.data()),
height, width, m, (width + m - 1) / m);
return 0;
}
In case you're willing to use a library, few pointers:
use NPP, set of primitives (from nvidia)
https://docs.nvidia.com/cuda/npp/group__image__filter__max.html
a lower level library, for other reduce operations and more granularity in the way you use the hardware (from nvidia / nvlabs)
http://nvlabs.github.io/cub/

Reduction Algorithm for Dot Product of Two 1D Vectors

I've been trying to work out an algorithm to get the dot product of two vectors within a CUDA program via reduction and seem to be stuck :/
In essence, I'm trying to write this code in CUDA:
for (int i = 0; i < n; i++)
h_h += h_a[i] * h_b[i];
Where h_a and h_b are arrays of floats and h_h sums up the dot product.
I'm trying to use reduction here - so far I've got this...
__global__ void dot_product(int n, float * d_a, float * d_b){
int i = threadIdx.x;
for (int stride = 1; i + stride < n; stride <<= 1) {
if (i % (2 * stride) == 0){
d_a[i] += d_a[i + stride] * d_b[i + stride];
}
__syncthreads();
}
}
If I change the main line to d_a[i] += d_a[i + stride];, it sums up the array just fine. I seem to be running into a parallel issue here from what I gather. Can someone point out my issue?
My kernel call is:
dot_product<<<1, n>>>(n, d_a, d_b);, where n is the size of each array.
There are two problems here:
As pointed out in comments, you never calculate the product of the first elements (this is a minor issue)
Your dot product calculation is incorrect. The parallel reduction should be performing a sum of the individual products of corresponding elements. Your code performs the product at every stage of the parallel reduction, so that products are getting multiplied again as they as are summed. That is incorrect.
You want to do something like this:
__global__ void dot_product(int n, float * d_a, float * d_b){
int i = threadIdx.x;
d_a[i] = d_a[i] * d_b[i]; // d_a now contains products
__syncthreads();
for (int stride = 1; i + stride < n; stride <<= 1) {
if (i % (2 * stride) == 0){
d_a[i] += d_a[i + stride]; // which are summed by reduction
}
__syncthreads();
}
}
[disclaimer: written in browser, never compiled or test, use at own risk]

cuda reach device function from global

I am trying to call a device function from global function. This function is only declaring an array to be used by all threads. But my problem when I printed the array its elements are not in the same order as declared. Is it because of all threads are creating the array again ? I confused about threads. If it is , Can I learn which thread is run first in global function and can I only allow it to declare the array for the others. Thanks.
Here my function to create array :
__device__ float myArray[20][20];
__device__ void calculation(int no){
filterWidth = 3+(2*no);
filterHeight = 3+(2*no);
int arraySize = filterWidth;
int middle = (arraySize - 1) / 2;
int startIndex = middle;
int stopIndex = middle;
// at first , all values of array are 0
for(int i=0; i<arraySize; i++)
for (int j = 0; j < arraySize; j++)
{
myArray[i][j] = 0;
}
// until middle line of the array, required indexes are 1
for (int i = 0; i < middle; i++)
{
for (int j = startIndex; j <= stopIndex; j++)
{ myArray[i][j] = 1; sum+=1; }
startIndex -= 1;
stopIndex += 1;
}
// for middle line
for (int i = 0; i < arraySize; i++)
{myArray[middle][i] = 1; sum+=1;}
// after middle line of the array, required indexes are 1
startIndex += 1;
stopIndex -= 1;
for (int i = (middle + 1); i < arraySize; i++)
{
for (int j = startIndex; j <= stopIndex; j++)
{ myArray[i][j] = 1; sum+=1; }
startIndex +=1 ;
stopIndex -= 1;
}
filterFactor = 1.0f / sum;
}
And global function :
__global__ void FilterKernel(Format24bppRgb* imageData)
{
int tidX = threadIdx.x + blockIdx.x * blockDim.x;
int tidY = threadIdx.y + blockIdx.y * blockDim.y;
Colour Cpixel = Colour (imageData[tidX + tidY*imageWidth] );
float depthPixel = Colour(depthData[tidX + tidY*imageWidth]).Red;
float absoluteDistanceFromFocus = fabs (depthPixel - focusDepth);
if(depthPixel == 0)
return;
Colour Cresult = Cpixel;
for (int i=0;i<8;i++)
{
calculation(i);
...
...
}
}
If you really want to select and force one thread to call the function and the rest to wait for it to do so, use __shared__ memory for the array created by the device function so that all threads in a block see the same one, and you can call it with:
for (int i=0;i<8;i++)
{
if (threadIdx.x == 0 && threadIdx.y == 0)
calculation(i);
__syncthreads();
...
}
Of course, this won't work between blocks - in a globally defined function, you have no control over the order in which blocks are computed.
Instead, if you can, you should do the initialization calculation (that only 1 thread needs to do) on the CPU and memcpy it to the GPU before launching your kernel. It looks like you'll use 8x the memory for your myArray's, but it'll dramatically speed up your computation.

Complicated for loop to be ported to a CUDA kernel

I have the next for nested loop and I would like to port it to CUDA to be run on a GPU
int current=0;
int ptr=0;
for (int i=0; i < Nbeans; i++){
for(int j=0;j< NbeamletsPerbeam[i];j++){
current = j + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[i];
}
}
I would be very happy if any body has an idea of how to do it or how can be done.
We are talking about Nbeams=5, NbeamletsPerBeam around 200 each.
This is what I currently have but I am not sure it is right...
for (int i= blockIdx.x; i < d_params->Nbeams; i += gridDim.x){
for (int j= threadIdx.y; j < d_beamletsPerBeam[i]; j+= blockDim.y){
currentBeamlet= j+k;
for (int ivoxel= threadIdx.x; ivoxel < totalVoxels; ivoxel += blockDim.x){
I would suggest this idea. But you might need to do some minor modifications based on your code.
dim3 blocks(NoOfThreads, 1);
dim3 grid(Nbeans, 1);
kernel<<grid, blocks, 1>>()
__global__ kernel()
{
int noOfBlocks = ( NbeamletsPerbeam[blockIdx.x] + blockDim.x -1)/blockDim.x;
for(int j=0; j< noOfBlocks;j++){
// use threads and compute....
if( (threadIdx.x * j) < NbeamletsPerbeam[blockIdx.x]) {
current = (threadIdx.x * j) + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[blockIdx.x];
}
}
}
This should do the trick and gives you better parallelization.

How to write CUDA global function for this?

I want to convert the following function into CUDA.
void fun()
{
for(i = 0; i < terrainGridLength; i++)
{
for(j = 0; j < terrainGridWidth; j++)
{
//CODE of function
}
}
}
I wrote the function like this :
__global__ void fun()
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < terrainGridLength)&&(j<terrainGridWidth))
{
//CODE of function
}
}
I declared both terrainGridLength and terrainGridWidth as constants and assigned value 120 to both. And I am calling function like
fun<<<30,500>>>()
But i am not getting correct output.
Is the code which i wrote is correct?.I didn't understood much about the parellel execution of the code.Please explain me how the code will work and correct me if i done any mistakes.
You use y dimension which means you are using 2D array threads, so you cannot invoke the kernel with only:
int numBlock = 30;
int numThreadsPerBlock = 500;
fun<<<numBlock,numThreadsPerBlock>>>()
The invocation should be: (Note that now Blocks have 2D Threads)
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 2D Grids with size = GRID_SIZE*GRID_SIZE
dim3 dimBlocks(BLOCK_SIZE, BLOCK_SIZE); //2D Blocks with size = BLOCK_SIZE*BLOCK_SIZE
fun<<<dimGrid, dimBlocks>>>()
Refer to CUDA Programming Guide for further info, and also if you want to do 2D array or 3D, you better use cudaMalloc3D or cudaMallocPitch
As of your code, I think this would work (but I haven't tried though, hope you can grab the idea with this):
//main
dim3 dimGrid(1, 1); // 2D Grids with size = 1
dim3 dimBlocks(Width, Height); //2D Blocks with size = Height*Width
fun<<<dimGrid, dimBlocks>>>(Width, Height)
//kernel
__global__ void fun(int Width, int Height)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < Width)&&(j<Height))
{
//CODE of function
}
}