I'm new to cuda; I have a 2D image (width, height) with 3 channels (colors).
What I want is to lunch a kernel that have 3D block and 2D grid like this
kernel_2D_3D<<<dim3(1,m,n), dim3(3,TILEy,TILEz)>>>(float *in, float *out)
I use x for colors, y for width and z for height. My question is:
How can I calculate the row and column of the image:
unsigned int Row = ?
unsigned int Col = ?
and the I use this function to calculate global unique index
__device__ int getGlobalIdx_2D_3D()
{
int blockId = blockIdx.x+ blockIdx.y * gridDim.x;
int Idx = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
return Idx;
}
If you are using y for width and z for height, then the row and column of the image will be calculated like this inside the kernel:
unsigned int row = blockIdx.z * blockDim.z + threadIdx.z;
unsigned int col = blockIdx.y * blockDim.y + threadIdx.y;
The current channel would be equal to threadIdx.x.
Related
I'm trying to learn CUDA and I'm a bit confused about calculating thread indices. Let's say I have this loop I'm trying to parallelize:
...
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
In PyCUDA, I set:
block = (8, 8, 8)
grid = (96, 96, 16)
Most of the examples I've seen for parallelizing loops calculate thread indices like this:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1)
DIM_x = 580, DIM_y = 550, psize = 50
However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.
Instead, if I use this (3D grid of 3D blocks):
int blockId = blockIdx.x + blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
int x = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x) + threadIdx.x;
It fixes the multiple same thread Ids problem for x, but I'm not sure how I'd parallelize y and dx.
If anyone could help me understand where I'm going wrong, and show me the right way to parallelize the loops, I'd really appreciate it.
However, if I print x, I see that multiple threads with the same
thread Id are created, and the final result is wrong.
It would be normal for you to see multiple threads with the same x thread ID in a multi-dimensional grid, as it would also be normal to observe many iterations of the loops in your host code with the same x value. If the result is wrong, it has nothing to do with any of the code you have shown, viz:
#include <vector>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <assert.h>
void host(int* array, int DIM_x, int DIM_y, int psize)
{
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
}
__global__
void kernel(int* array, int DIM_x, int DIM_y, int psize)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1);
}
int main()
{
dim3 block(8, 8, 8);
dim3 grid(96, 96, 16);
int DIM_x = 580, DIM_y = 550, psize = 50;
std::vector<int> array_h(DIM_x * DIM_y * psize, 0);
std::vector<int> array_hd(DIM_x * DIM_y * psize, 0);
thrust::device_vector<int> array_d(DIM_x * DIM_y * psize, 0);
kernel<<<grid, block>>>(thrust::raw_pointer_cast(array_d.data()), DIM_x, DIM_y, psize);
host(&array_h[0], DIM_x, DIM_y, psize);
thrust::copy(array_d.begin(), array_d.end(), array_hd.begin());
cudaDeviceSynchronize();
for(int i=0; i<DIM_x * DIM_y * psize; i++) {
assert( array_h[i] == array_hd[i] );
}
return 0;
}
which when compiled and run
$ nvcc -arch=sm_52 -std=c++11 -o looploop loop_the_loop.cu
$ cuda-memcheck ./looploop
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
emits no errors and passes the check of all elements against the host code in your question.
If you are getting incorrect results, it is likely that you have a problem with initialization of the device memory before running the kernel. Otherwise I fail to see how incorrect results could be emitted by the code you have shown.
In general, performing a large number of atomic memory transactions, as your code does, is not the optimal way to perform computation on the GPU. Using non-atomic transactions would probably need to rely on other a priori information about the structure of the problem (such as a graph decomposition or a precise description of the write patterns of the problem).
In a 3D grid with 3D blocks, the thread ID is:
unsigned long blockId = blockIdx.x
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
unsigned long threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
Not the x you computed. The x is only the x index of that 3D matrix.
There is a nice cheatsheet in this blog
I have a kernel, how can I get the number of used registers per thread when launching the kernels? I mean in a PyCuda way.
A simple example will be:
__global__
void
make_blobs(float* matrix, float2 *pts, int num_pts, float sigma, int rows, int cols) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < cols && y < rows) {
int idx = y*cols + x;
float temp = 0.f;
for (int i = 0; i < num_pts; i++) {
float x_0 = pts[i].x;
float y_0 = pts[i].y;
temp += exp(-(pow(x - x_0, 2) + pow(y - y_0, 2)) / (2 * sigma*sigma));
}
matrix[idx] = temp;
}
}
Is there anyway to get the number without crashing the program if the real number used has exceeded the max?
The above is OK, it dose not exceed the max in my machine. I just want to get the number in a convenient way. Thanks!
PyCuda already provides this as part of the Cuda function object. The property is called pycuda.driver.Function.num_regs.
Below is a small example that shows how to use it:
import pycuda.autoinit
from pycuda.compiler import SourceModule
kernel_src = """
__global__ void
make_blobs(float* matrix, float2 *pts, int num_pts, float sigma, int rows, int cols) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < cols && y < rows) {
int idx = y*cols + x;
float temp = 0.f;
for (int i = 0; i < num_pts; i++) {
float x_0 = pts[i].x;
float y_0 = pts[i].y;
temp += exp(-(pow(x - x_0, 2) + pow(y - y_0, 2)) / (2 * sigma*sigma));
}
matrix[idx] = temp;
}
}"""
compiledKernel = SourceModule(kernel_src)
make_blobs = compiledKernel.get_function("make_blobs")
print(make_blobs.num_regs)
Note that you don't need to use SourceModule. You can also load the module from e.g. a cubin file. More details can be found in the documentation.
I am having an issue with the following code. The following code takes an input image and it should save the grayscale of it. Unfortunately, it seems to perform the expected behavior but it is processing just a part of the image and not the whole. It seems that the problems occurs in the cudamemcpy from device to host.
i believe that probably I got some issue while I am allocating memory in Cuda.
__global__ void rgb2grayCudaKernel(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height)
{
int ty = (blockIdx.x * blockDim.x) + threadIdx.x;
//int tx = (blockIdx.x * blockDim.x) + threadIdx.x;
int tx = (blockIdx.y * blockDim.y) + threadIdx.y;
if( (ty < height && tx<width) )
{
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(ty * width) + tx]);
float g = static_cast< float >(inputImage[(width * height) + (ty * width) + tx]);
float b = static_cast< float >(inputImage[(2 * width * height) + (ty * width) + tx]);
grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
grayImage[(ty * width) + tx] = static_cast< unsigned char >(grayPix);
}
}
//***************************************rgb2gray function, call of kernel in here *************************************
void rgb2grayCuda(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height)
{
unsigned char *inputImage_c, *grayImage_c;
const int sizee= (width*height);
// **********memory allocation for pointers and cuda******************
cudaMalloc((void **) &inputImage_c, sizee);
checkCudaError("im not alloc!");
cudaMalloc((void **) &grayImage_c, sizee);
checkCudaError("gray not alloc !");
//***********copy to device*************************
cudaMemcpy(inputImage_c, inputImage, sizee*sizeof(unsigned char), cudaMemcpyHostToDevice);
checkCudaError("im not send !");
cudaMemcpy(grayImage_c, grayImage, sizee*sizeof(unsigned char), cudaMemcpyHostToDevice);
checkCudaError("gray not send !");
dim3 thrb(32,32);
dim3 numb (ceil(width*height/1024));
//**************Execute Kernel (Timer in here)**************************
NSTimer kernelTime = NSTimer("kernelTime", false, false);
kernelTime.start();
rgb2grayCudaKernel<<<numb,1024>>> (inputImage_c, grayImage_c, width, height);
checkCudaError("kernel!");
kernelTime.stop();
//**************copy back to host*************************
printf("/c");
cudaMemcpy(grayImage, grayImage_c, sizee*sizeof(unsigned char), cudaMemcpyDeviceToHost);
checkCudaError("Receiving data from CPU failed!");
//*********************free memory***************************
cudaFree(inputImage_c);
cudaFree(grayImage_c);
//**********************print time****************
cout << fixed << setprecision(6);
cout << "rgb2gray (cpu): \t\t" << kernelTime.getElapsed() << " seconds." << endl;
}
const int sizee= (width*height);
should be:
const int sizee= (width*height*3);
for rgb data (1 byte per channel).
I believe in bitmap images, the colors should be interleaved as in:
rgb of pixel1, rgb of pixel 2 ... rgb of pixel width*height
Therefore your kernel should be:
__global__ void rgb2grayCudaKernel(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height)
{
int tx = (blockIdx.y * blockDim.y) + threadIdx.y;
int ty = (blockIdx.x * blockDim.x) + threadIdx.x;
if( (ty < height && tx<width) )
{
unsigned int pixel = ty*width+tx;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[pixel*3]);
float g = static_cast< float >(inputImage[pixel*3+1]);
float b = static_cast< float >(inputImage[pixel*3+2]);
grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);
grayImage[pixel] = static_cast< unsigned char >(grayPix);
}
}
Also, from what I saw luminosity is calculated as 0.21 R + 0.72 G + 0.07 B.
I'm implementing a CUDA program for transposing an image. I created 2 kernels. The first kernel does out of place transposition and works perfectly for any image size.
Then I created a kernel for in-place transposition of square images. However, the output is incorrect. The lower triangle of the image is transposed but the upper triangle remains the same. The resulting image has a stairs like pattern in the diagonal and the size of each step of the stairs is equal to the 2D block size which I used for my kernel.
Out-of-Place Kernel:
Works perfectly for any image size if src and dst are different.
template<typename T, int blockSize>
__global__ void kernel_transpose(T* src, T* dst, int width, int height, int srcPitch, int dstPitch)
{
__shared__ T block[blockSize][blockSize];
int col = blockIdx.x * blockSize + threadIdx.x;
int row = blockIdx.y * blockSize + threadIdx.y;
if((col < width) && (row < height))
{
int tid_in = row * srcPitch + col;
block[threadIdx.y][threadIdx.x] = src[tid_in];
}
__syncthreads();
col = blockIdx.y * blockSize + threadIdx.x;
row = blockIdx.x * blockSize + threadIdx.y;
if((col < height) && (row < width))
{
int tid_out = row * dstPitch + col;
dst[tid_out] = block[threadIdx.x][threadIdx.y];
}
}
In-Place Kernel:
template<typename T, int blockSize>
__global__ void kernel_transpose_inplace(T* srcDst, int width, int pitch)
{
__shared__ T block[blockSize][blockSize];
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int tid_in = row * pitch + col;
int tid_out = col * pitch + row;
if((row < width) && (col < width))
block[threadIdx.x][threadIdx.y] = srcDst[tid_in];
__threadfence();
if((row < width) && (col < width))
srcDst[tid_out] = block[threadIdx.x][threadIdx.y];
}
Wrapper Function:
int transpose_8u_c1(unsigned char* pSrcDst, int width,int pitch)
{
//pSrcDst is allocated using cudaMallocPitch
dim3 block(16,16);
dim3 grid;
grid.x = (width + block.x - 1)/block.x;
grid.y = (width + block.y - 1)/block.y;
kernel_transpose_inplace<unsigned char,16><<<grid,block>>>(pSrcDst,width,pitch);
assert(cudaSuccess == cudaDeviceSynchronize());
return 1;
}
Sample Input & Wrong Output:
I know this problem has something to do with the logic of in-place transpose. This is because my out of place transpose kernel which is working perfectly for different source and destination, also gives the same wrong result if I pass it a single pointer for source and destination.
What am I doing wrong? Help me in correcting the In-place kernel.
Your in-place kernel is overwriting data in the image that will be subsequently picked up by another thread to use for its transpose operation. So for a square image, you should buffer the destination data before overwriting it, then place the destination data in it's proper transposed location. Since we're doing effectively 2 copies per thread using this method, there's only a need to use half as many threads. Something like this should work:
template<typename T, int blockSize>
__global__ void kernel_transpose_inplace(T* srcDst, int width, int pitch)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int tid_in = row * pitch + col;
int tid_out = col * pitch + row;
if((row < width) && (col < width) && (row<col)) {
T temp = srcDst[tid_out];
srcDst[tid_out] = srcDst[tid_in];
srcDst[tid_in] = temp;
}
}
I have following kernel code for matrix manipulation. Matrix A = 1*3 and Matrix B = 3*3 resultant Matrix C would be 1*3. In the following code the width would be 3.
__global__void MatrixMulKernel(float* d_M,float* d_N,float* d_P,int Width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row>=Width || col>=Width){ // matrix range
return;
}
float P_val = 0.0f;
for (int k = 0; k < Width; ++k) {
float M_elem = d_M[row * Width + k];
float N_elem = d_N[k * Width + col];
P_val += M_elem * N_elem;
}
d_p[row*Width+col] = P_val;
}
I kernel code is called as follows
int block_size = 32;
dim3 dimGrid(Width/block_size, Width/block_size);
dim3 dimBlock(block_size, block size);
MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P,3);
But I am getting wrong results. I am getting results as zero always.
Can anyone help me please.
The code looks likes its for multiplication of 2 square matrices of same size.
Width is the number of columns of the first matrix.
You have to provide this as an argument to the function.