The following kernel performs matrix copy that I came across in this article:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}
I am confused with the notation used. From what I understand, the data is in row-major format. "y" corresponds to rows and "x" corresponds to columns. So, the linear index is calculated as data[y][x] = data[y*width+x];
How is odata[(y+j)*width + x] coalesced? In row-major, elements in the same row are in successive locations. So, accessing elements in the fashion, (y,x) (y,x+1) (y,x+2) ... is contiguous.
However "j" above is added to "y" which does not seem coalesced.
Is my understanding of the notation incorrect or am I missing something here?
Coalescing memory transactions only requires that threads from the same warp read and write into a contiguous block of memory which can be served by a single transaction. Your code
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
odata[(y+j)*width + x] = idata[(y+j)*width + x];
produces coalesced access because j is constant across every thread in a warp. So the access patterns become:
0. (y * width); (y * width + 1); (y * width + 2); .....
1. (y * width + width); (y * width + width + 1); (y * width + width + 2); .....
2. (y * width + 2 * width); (y * width + 2 * width + 1); (y * width + 2 * width + 2); .....
Within each warp at any value of J access is still sequential elements with memory, so reads and writes will coalesce.
Related
I have to implement Box filter using GPU with CUDA and I'm doing it on Google Colab. The code runs without any errors but my resulting image is all black.
This is my blurring function:
__global__ void apply_box_blur(int height, int width, unsigned char* buffer, unsigned char* out) {
int i, j;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < 2 || col < 2 || row >= height -3 || col >= width -3 ) return ;
float v = 1.0 / 9.0;
float kernel[3][3] = { {v,v,v},
{v,v,v},
{v,v,v} };
float sum0 = 0.0;
float sum1 = 0.0;
float sum2 = 0.0;
for (i = -1; i <= 1; i++)
{
for (j = -1; j <= 1; j++)
{
// matrix multiplication with kernel with every color plane
sum0 = sum0 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 0];
sum1 = sum1 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 1];
sum2 = sum2 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 2];
}
}
out[(row * width + col) * 3 + 0] = (unsigned char)sum0;
out[(row * width + col) * 3 + 1] = (unsigned char)sum1;
out[(row * width + col) * 3 + 2] = (unsigned char)sum2;
};
And my main function:
// device copies
unsigned char* d_buffer;
unsigned char* d_out;
// allocate space for device copies
cudaMalloc((void**)&d_buffer, size * 3 * sizeof(unsigned char));
cudaMalloc((void**)&d_out, size * 3 * sizeof(unsigned char));
// Copy inputs to device
cudaMemcpy(d_buffer, buffer, size * 3 * sizeof(unsigned char), cudaMemcpyHostToDevice);
// perform the Box blur and store the resulting pixels in the output buffer
dim3 block(16, 16);
dim3 grid(width / 16, height / 16);
apply_box_blur <<<grid, block>>> (height, width, d_buffer, d_out);
cudaMemcpy(out, d_out, size * 3 * sizeof(unsigned char), cudaMemcpyDeviceToHost);
Am I doing something wrong with the block and grid sizes? Or is there something wrong with my blurring function? Is it maybe a Google Colab issue?
Found the issue.
The block and grid sizes should've been this:
dim3 blockSize(16, 16, 1);
dim3 gridSize((size*3)/blockSize.x, (size*3)/blockSize.y, 1);
Also my Google Colab wasn't connected to a GPU.
I'm trying to solve the 2D Laplace equation with shared memory. But one strange thing is that the blockDim.y value is always 1.Could someone help me?
host code
checkCudaErrors(cudaMalloc((void**)&d_A, h*h * sizeof(float)));
checkCudaErrors(cudaMalloc((void**)&d_out, h*h * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_A, A, h*h * sizeof(float), cudaMemcpyHostToDevice));
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
dim3 gridSize = ((h+BLOCK_SIZE-1)/BLOCK_SIZE, (h + BLOCK_SIZE - 1) / BLOCK_SIZE);
LaplaceDifference << <gridSize, blockSize >> > (d_A, h, d_out);
checkCudaErrors(cudaMemcpy(B, d_out, h*h * sizeof(float), cudaMemcpyDeviceToHost));
kernel code
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int idy = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float A_ds[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
int n = 1;
//Load data in shared memory
int halo_index_left = (blockIdx.x - 1)*blockDim.x + threadIdx.x;
int halo_index_right = (blockIdx.x + 1)*blockDim.x + threadIdx.x;
int halo_index_up = (blockIdx.y - 1)*blockDim.y + threadIdx.y;
int halo_index_down = (blockIdx.y + 1)*blockDim.y + threadIdx.y;
A_ds[n + threadIdx.y][n + threadIdx.x] = A[idy * h +idx];
if (threadIdx.x >= blockDim.x - n) {
A_ds[threadIdx.y + n][threadIdx.x - (blockDim.x - n)] = (halo_index_left < 0) ? 0 : A[idy*h + halo_index_left];
}
if (threadIdx.x < n) {
A_ds[threadIdx.y + n][blockDim.x + n + threadIdx.x] = (halo_index_right >= h) ? 0 : A[idy*h + halo_index_right];
}
if (threadIdx.y >= blockDim.y - n) {
A_ds[threadIdx.y - (blockDim.y - n)][threadIdx.x+n] = (halo_index_up < 0) ? 0 : A[halo_index_up*h + idx];
}
if (threadIdx.y < n) {
A_ds[blockDim.y + n + threadIdx.y][threadIdx.x + n] = (halo_index_down >= h) ? 0 : A[halo_index_down*h + idx];
}
__syncthreads();
P[idy*h + idx] = 0.25*(A_ds[threadIdx.y + n - 1][threadIdx.x + n] + A_ds[threadIdx.y + n + 1][threadIdx.x + n] + A_ds[threadIdx.y + n][threadIdx.x + n - 1] + A_ds[threadIdx.y + n][threadIdx.x + n + 1]);
(I spent quite some time looking for a dupe, but could not find it.)
A dim3 variable is a particular data type defined in the CUDA header file vector_types.h.
It provides several constructors. Here are a couple valid uses of constructors for this variable:
dim3 grid(gx, gy, gz);
dim3 grid = dim3(gx, gy, gz);
What you have shown:
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
won't work the way you expect.
Since there is no dim3 usage on the right hand side of the equal sign, the compiler will use some other method to process what is there. It is not a syntax error, because both the use of parentheses and the comma are legal in this form, from a C++ language perspective.
Hopefully you understand how parentheses work in C++. I'm not going to try to describe the comma operator, you can read about it here and here. The net effect is that the compiler will evaluate each of the two expressions (one on the left of the comma, one on the right) and it will evaluate the overall expression value as the value produced by the evaluation of the expression on the right. So this:
(BLOCK_SIZE, BLOCK_SIZE)
becomes this:
BLOCK_SIZE
which is quite obviously a scalar quantity, not multi-dimensional.
When you assign a scalar to a dim3 variable:
dim3 blockSize = BLOCK_SIZE;
You end up with a dim3 variable that has these dimensions:
(BLOCK_SIZE, 1, 1)
One method to fix what you have is as follows:
dim3 blockSize = dim3(BLOCK_SIZE, BLOCK_SIZE);
^^^^
This line:
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
initializes a 1D block size. What you want is:
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);
I'm trying to learn CUDA and I'm a bit confused about calculating thread indices. Let's say I have this loop I'm trying to parallelize:
...
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
In PyCUDA, I set:
block = (8, 8, 8)
grid = (96, 96, 16)
Most of the examples I've seen for parallelizing loops calculate thread indices like this:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1)
DIM_x = 580, DIM_y = 550, psize = 50
However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.
Instead, if I use this (3D grid of 3D blocks):
int blockId = blockIdx.x + blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
int x = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x) + threadIdx.x;
It fixes the multiple same thread Ids problem for x, but I'm not sure how I'd parallelize y and dx.
If anyone could help me understand where I'm going wrong, and show me the right way to parallelize the loops, I'd really appreciate it.
However, if I print x, I see that multiple threads with the same
thread Id are created, and the final result is wrong.
It would be normal for you to see multiple threads with the same x thread ID in a multi-dimensional grid, as it would also be normal to observe many iterations of the loops in your host code with the same x value. If the result is wrong, it has nothing to do with any of the code you have shown, viz:
#include <vector>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <assert.h>
void host(int* array, int DIM_x, int DIM_y, int psize)
{
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
}
__global__
void kernel(int* array, int DIM_x, int DIM_y, int psize)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1);
}
int main()
{
dim3 block(8, 8, 8);
dim3 grid(96, 96, 16);
int DIM_x = 580, DIM_y = 550, psize = 50;
std::vector<int> array_h(DIM_x * DIM_y * psize, 0);
std::vector<int> array_hd(DIM_x * DIM_y * psize, 0);
thrust::device_vector<int> array_d(DIM_x * DIM_y * psize, 0);
kernel<<<grid, block>>>(thrust::raw_pointer_cast(array_d.data()), DIM_x, DIM_y, psize);
host(&array_h[0], DIM_x, DIM_y, psize);
thrust::copy(array_d.begin(), array_d.end(), array_hd.begin());
cudaDeviceSynchronize();
for(int i=0; i<DIM_x * DIM_y * psize; i++) {
assert( array_h[i] == array_hd[i] );
}
return 0;
}
which when compiled and run
$ nvcc -arch=sm_52 -std=c++11 -o looploop loop_the_loop.cu
$ cuda-memcheck ./looploop
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
emits no errors and passes the check of all elements against the host code in your question.
If you are getting incorrect results, it is likely that you have a problem with initialization of the device memory before running the kernel. Otherwise I fail to see how incorrect results could be emitted by the code you have shown.
In general, performing a large number of atomic memory transactions, as your code does, is not the optimal way to perform computation on the GPU. Using non-atomic transactions would probably need to rely on other a priori information about the structure of the problem (such as a graph decomposition or a precise description of the write patterns of the problem).
In a 3D grid with 3D blocks, the thread ID is:
unsigned long blockId = blockIdx.x
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
unsigned long threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
Not the x you computed. The x is only the x index of that 3D matrix.
There is a nice cheatsheet in this blog
How can I count the number of cycles performed by a function like the following. Should I count straight forward the number of sums and muls and divs? Where can I check how many cycles an addition takes in CUDA?
__global__
void mandelbrotSet_per_element(Grayscale *image){
float minR = -2.0f, maxR = 1.0f;
float minI = -1.2f, maxI = minI + (maxR-minR) * c_rows / c_cols;
float realFactor = (maxR - minR) / (c_cols-1);
float imagFactor = (maxI - minI) / (c_rows-1);
bool isInSet;
float c_real, c_imag, z_real, z_imag;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int x = blockDim.x * blockIdx.x + threadIdx.x;
while (y < c_rows){
while (x < c_cols) {
c_real = minR + x * realFactor;
c_imag = maxI - y * imagFactor;
z_real = c_real; z_imag = c_imag;
isInSet = true;
for (int k = 0; k < c_iterations; k++){
float z_real2 = z_real * z_real;
float z_imag2 = z_imag * z_imag;
if (z_real2 + z_imag2 > 4){
isInSet = false;
break;
}
z_imag = 2 * z_real * z_imag + c_imag;
z_real = z_real2 - z_imag2 + c_real;
}
if (isInSet) image[y*c_cols+x] = 255;
else image[y*c_cols+x] = 0;
x += blockDim.x * gridDim.x;
}
x = blockDim.x * blockIdx.x + threadIdx.x;
y += blockDim.y * gridDim.y;
}
}
Instruction throughput is described in the programming guide here
You can also try measuring a sequence of instructions using the native clock() function described here
The compiler tends to obscure actual counts of operations at the source code level (increasing or possibly decreasing apparent arithmetic intensity) so if you want to indentify exactly what the machine is doing you may want to inspect the ptx (nvcc -ptx ...) or possibly the machine assembly level code, called SASS, which you can extract from an executable using the cuobjdump utility.
I have following kernel code for matrix manipulation. Matrix A = 1*3 and Matrix B = 3*3 resultant Matrix C would be 1*3. In the following code the width would be 3.
__global__void MatrixMulKernel(float* d_M,float* d_N,float* d_P,int Width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row>=Width || col>=Width){ // matrix range
return;
}
float P_val = 0.0f;
for (int k = 0; k < Width; ++k) {
float M_elem = d_M[row * Width + k];
float N_elem = d_N[k * Width + col];
P_val += M_elem * N_elem;
}
d_p[row*Width+col] = P_val;
}
I kernel code is called as follows
int block_size = 32;
dim3 dimGrid(Width/block_size, Width/block_size);
dim3 dimBlock(block_size, block size);
MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P,3);
But I am getting wrong results. I am getting results as zero always.
Can anyone help me please.
The code looks likes its for multiplication of 2 square matrices of same size.
Width is the number of columns of the first matrix.
You have to provide this as an argument to the function.