Confusion over grid and block dimensions - cuda

I am trying to solve the problem at the end of lesson 1 of the Udacity course but I'm not sure if I have just made a typo or if the actual code is wrong.
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
size_t totalPixels = numRows * numCols;
size_t gridRows = totalPixels / 32;
size_t gridCols = totalPixels / 32;
const dim3 blockSize(32,32,1);
const dim3 gridSize(gridCols,gridRows,1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
The other method is:
void rgba_to_greyscale(const uchar4* const rgbaImage, unsigned char* const greyImage, int numRows, int numCols)
{
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
uchar4 rgba = rgbaImage[x * numCols + y];
float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
greyImage[x * numCols + y] = channelSum;
}
Error message says the following:
libdc1394 error: failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()
we were unable to execute your code. Did you set the grid and/or block size correctly?
But then, it says that the code has compiled,
Your code compiled!
error output: libdc1394 error: Failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()
Line 76 is the last line in the first code block and as far as I'm aware I haven't changed anything in it. Line 76 is as follows,
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
I can't actually find the declaration of cudaGetLastError().
I'm mainly concerned with my understanding on setting up the grid/block dimensions + whether the first methods approach was right with regards to mapping between a 1D array of pixel positions and my threads.
EDIT:
I guess I've misunderstood something. Is numRows the number of pixels in the vertical? And is numCols the pixels in horizontal direction?
My block is made up of 8 x 8 threads, where each thread represents 1 pixel? If so, I'm assuming that's why I had to divide by 4 when calculating gridRows since the image is not square? I'm assuming I could have also made a block that was 2:1 columns : rows?
EDIT 2:
I just tried to change my block so that it was 2:1 ratio, so I could then divide numRows and numCol by the same number but its now showing blank areas at the bottom and side. Why is there blank areas both at the bottom and side. I haven't changed the y dimensions of by grid or block.

each blocks processes 32*32 pixels, and there are (totalPixels / 32) * (totalPixels / 32) blocks, so you process totalPixels ^ 2 pixels - that seems wrong
1st was wrong, this should be the correct one:
const dim3 blockSize(32,32,1);
size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;
it is a pretty common pattern for 2d - you can remember it
in the sample image size is not power of two and you want block to cover all your image(or even more)
so next must be correct:
gridCols * blockSize.x >= numCols
gridRows * blockSize.y >= numRows
you choose block size and basing on it you compute amount of blocks you need to cover all image
after that, in the kernel, you must check that you are not 'out of image', for cases with bad size
another problem is in the kernel, it must be (y * numCols + x), not oposite
kernel:
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < numCols && y < numRows)
{
uchar4 rgba = rgbaImage[y * numCols + x];
float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
greyImage[y * numCols + x] = channelSum;
}
calling code:
const dim3 blockSize(4,32,1); // may be any
size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;
const dim3 gridSize(gridCols,gridRows,1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
damn it, i feel like i am doing things even harder to understand (

Related

Udacity parallel programming, unspecified launch failure cudaGetLastError()

I am trying to complete homework #2 for Udacity course parallel programming. I have ran into a CUDA error that I just can't get around. The error is encoutnered when I launch a kernel that is meant to separate an image in the format "RGBRGBRGB" to three separate arrays of "RRR" "GGG" and "BBB". Seeing as the error "unspecified launch failure" does not give me anything specific to go on I am not sure how to trouble shoot my issue.
Here is the "main" function called to start the entire process. I left out the rest after the error is encountered so that I don't post the rest of my work for someone to find later.
void your_gaussian_blur(const uchar4 * const h_inputImageRGBA, uchar4 * const d_inputImageRGBA, uchar4* const d_outputImageRGBA, const size_t numRows, const size_t numCols,
unsigned char *d_redBlurred,
unsigned char *d_greenBlurred,
unsigned char *d_blueBlurred,
const int filterWidth)
{
// Maximum number of threads per block = 512; do this
// to keep this compatable with CUDa 5 and lower
// MAX > threadsX * threadsY * threadsZ
int MAXTHREADSx = 16;
int MAXTHREADSy = 16; // 16 x 16 x 1 = 512
// We want to fill the blocks so we don't waste this blocks threads
// I wonder if blocks can intermix in a physical core?
// Either way this method makes things "clean"; one thread per px
int nBlockX = numCols / MAXTHREADSx + 1;
int nBlockY = numRows / MAXTHREADSy + 1;
const dim3 blockSize(MAXTHREADSx, MAXTHREADSy, 1);
const dim3 gridSize(nBlockX, nBlockY, 1);
separateChannels<<<gridSize, blockSize>>>(
h_inputImageRGBA,
numRows,
numCols,
d_red,
d_green,
d_blue);
// Call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after
// launching your kernel to make sure that you didn't make any mistakes.
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
And here is the function separateChannels
//This kernel takes in an image represented as a uchar4 and splits
//it into three images consisting of only one color channel each
__global__
void separateChannels(const uchar4* const inputImageRGBA,
int numRows,
int numCols,
unsigned char* const redChannel,
unsigned char* const greenChannel,
unsigned char* const blueChannel)
{
//const int2 thread_2D_pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
const int col = blockIdx.x * blockDim.x + threadIdx.x;
const int row = blockIdx.y * blockDim.y + threadIdx.y;
//if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows)
// return;
if (col >= numCols || row >= numRows)
return;
//const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;
int arrayPos = row * numCols + col;
uchar4 rgba = inputImageRGBA[arrayPos];
redChannel[arrayPos] = rgba.x;
greenChannel[arrayPos] = rgba.y;
blueChannel[arrayPos] = rgba.z;
}
I think I put in anything necessary, please let me know if not.
Without seeing the rest of the code I cannot tell for sure, but I believe you are sending pointer to host memory as a parameter to cuda kernel - not a good thing to do. In kernel launch you are sending in a h_inputImageRGBA while I believe you want to send in a d_inputImageRGBA.
Typically h_ prefix stands for host memory while d_ represents device.

CUDA: creating logical picture of grid

I've just started learning CUDA and I'm stuck at a fundamental concept: grids. I've read that grid is just a logical collection of blocks (?), but I'm unable to create a picture of the scene in my mind. I have a clear picture of threads and blocks in my mind and know where they relate to in physical GPU. Block go to cores and threads go to stream-processors. But where does grid fit into this picture?
Some analogies would be appreciated and would make understanding easier.
P.s.- I'm learning from udacity.
#include "reference_calc.cpp"
#include "utils.h"
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int x,y,i; // i is index for 1D array greyImage. x and y for rgbaImage
i = (blockIdx.y * blockDim.x) + blockIdx.x;
x= (blockIdx.x * blockDim.x) + threadIdx.x;
y= (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < numCols && y < numRows)
{
greyImage[i] = (0.299f * rgbaImage[y].x) + (0.587f * rgbaImage[y].y) + (0.114f * rgbaImage[y].z);
}
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
//You must fill in the correct sizes for the blockSize and gridSize
//currently only one block with one thread is being launched
const dim3 blockSize(10, 10, 1); //TODO
size_t gridSizeX, gridSizeY;
gridSizeX = numCols + (10 - (numCols % 10) ); //adding some number to make it multiple of 10
gridSizeY = numRows + (10 - (numRows % 10) ); //adding some number to make it multiple of 10
const dim3 gridSize( gridSizeX, gridSizeY, 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
Actually threads go to compute cores (at least if we are referring to the marketing term "cuda cores") and thread-blocks are associated with streaming multiprocessors (SMs, or SMX's in Kepler-speak).
The GRID is all threads created by a kernel launch. You can call it the collection of blocks if you want to, since a grid is first hierarchically broken down into blocks, (then warps,) then threads.
For a pictorial representation of this hierarchy, refer to slide 9 of this webinar deck.
You can disregard the statement on that slide that "only one kernel can be launched at a time". That was true in 2009 when that deck was created, but is no longer true on newer devices today.

tex object access always returns zero -- any ideas?

I'm running CUDA 5.0, with compute_30,sm_30 set using a 670.
I create a mipmapped array via:
cudaExtent size;
size.width = window_width; // 600
size.height = window_height; // 600
size.depth = 1;
int levels = getMipMapLevels(size);
levels = MIN(levels, 9); // 9
cudaChannelFormatDesc fp32;
fp32.f = cudaChannelFormatKindFloat;
fp32.x = fp32.y = fp32.z = fp32.w = 32;
cudaMipmappedArray_t A;
checkCuda(cudaMallocMipmappedArray(&A, &fp32, size, levels, cudaArraySurfaceLoadStore));
I load the first level of A with surf2Dwrites. I know that works since I copy that array to the host and dump it to an image file. I now wish to fill the other miplevels of A with the mipmaps. One iteration through that loop looks like:
width >>= 1; width = MAX(1, width);
height >>= 1; height = MAX(1, height);
cudaArray_t from, to;
checkCuda(cudaGetMipmappedArrayLevel(&from, A, newlevel-1));
checkCuda(cudaGetMipmappedArrayLevel(&to, A, newlevel));
cudaTextureObject_t from_texture;
create_texture_object(from, true, &from_texture);
cudaSurfaceObject_t to_surface;
create_surface_object(to, &to_surface);
dim3 blocksize(16, 16, 1);
dim3 gridsize((width+blocksize.x-1)/blocksize.x,(height+blocksize.y-1)/blocksize.y, 1);
d_mipmap<<<gridsize, blocksize>>>(to_surface, from_texture, width, height);
checkCuda(cudaDeviceSynchronize());
checkCuda(cudaGetLastError());
uncreate_texture_object(&from_texture);
uncreate_surface_object(&to_surface);
The create_surface_object() code is known to work. Just in case, here's the create_texture_object() code:
static void create_texture_object(cudaArray_t tarray, bool filter_linear, cudaTextureObject_t *tobject)
{
assert(tarray && tobject);
// build the resource
cudaResourceDesc color_res;
memset(&color_res, 0, sizeof(cudaResourceDesc));
color_res.resType = cudaResourceTypeArray;
color_res.res.array.array = tarray;
// the texture descriptor
cudaTextureDesc texdesc;
memset(&texdesc, 0, sizeof(cudaTextureDesc));
texdesc.addressMode[0] = cudaAddressModeClamp;
texdesc.addressMode[1] = cudaAddressModeClamp;
texdesc.addressMode[2] = cudaAddressModeClamp;
texdesc.filterMode = filter_linear ? cudaFilterModeLinear : cudaFilterModePoint;
texdesc.normalizedCoords = 1;
checkCuda(cudaCreateTextureObject(tobject, &color_res, &texdesc, NULL));
}
The d_mipmap device function is the following:
__global__ void
d_mipmap(cudaSurfaceObject_t out, cudaTextureObject_t in, int w, int h)
{
float x = blockIdx.x * blockDim.x + threadIdx.x;
float y = blockIdx.y * blockDim.y + threadIdx.y;
float dx = 1.0/float(w);
float dy = 1.0/float(h);
if ((x < w) && (y < h))
{
#if 0
float4 color =
(tex2D<float4>(in, (x + .25f) * dx, (y + .25f) * dy)) +
(tex2D<float4>(in, (x + .75f) * dx, (y + .25f) * dy)) +
(tex2D<float4>(in, (x + .25f) * dx, (y + .75f) * dy)) +
(tex2D<float4>(in, (x + .75f) * dx, (y + .75f) * dy));
color /= 4.0f;
surf2Dwrite(color, mipOutput, x * sizeof(float4), y);
#endif
float4 color0 = tex2D<float4>(in, (x + .25f) * dx, (y + .25f) * dy);
surf2Dwrite(color0, out, x * sizeof(float4), y);
}
}
That contains both the mipmap sampling code (if'd out) plus debugging code.
The problem is, color0 is always uniformly zero, and I've been unable to understand why. I've changed the filtering to point (from linear) with no success. I've checked for errors. Nothing.
I am using CUDA/OpenGL interop here, but the mipmap generation is being done on CUDA arrays only.
I really really do not want to have to use texture references.
Any suggestions on where to look?
The bug turns out to be the use of cudaMipmappedArrays (either the array or the texture object -- I'm unable to tell which is broken.)
When I modify the code to use cudaArrays only, the texture reference starts working again.
Since the bindless texture program sample works, the bug appears to be limited to float32 channel mipmapped textures only. (I have a test program that shows the bug occurs with both 1 and 4 channel float32 mipmapped textures.)
I've reported the bug to Nvidia.

2D kernel calling and launch parameters for non-square matrix

I am attempting to port the following (simplified) nested loop as a CUDA 2D kernel. The sizes of NgS and NgO will increase with larger data sets; for now I just want to get this kernel to output the correct results for all values:
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int NgS = 1859;
int NgO = 900;
// 1D flattened matrices have been initialized as:
Radio_cpu = new double [NgS*NgO];
Result_cpu = new double [NgS*NgO];
// ignoring the part where they are filled w/ data
for (m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Result_cpu[idx(n,m,NgO)]] = k0*Radio_cpu[idx(n,m,NgO)]];
}
}
The examples I have come across usually deal with square loops, and I have been unable to get the correct output for all the GPU array indices compared to the CPU version. Here is the host code calling the kernel:
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
// Result_gpu and Radio_gpu are allocated versions of the CPU variables on GPU
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, Radio_gpu, Result_gpu);
Here is the kernel:
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgS || m > NgO) return;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int idxxx = m + (n * grid_width);
Result[idxxx] = k0 * Radio[idxxx];
}
With the current code, I proceeded to compare the Result_cpu variable with Result_gpu variable once copied back. When I cycle through the values I get:
// matches from NgS = 0...913
Result_gpu[NgS = 913][NgO = 0]: -56887.2
Result_cpu[Ngs = 913][NgO = 0]: -56887.2
// mismatches from NgS = 914...1858
Result_gpu[NgS = 914][NgO = 0]: -12.2352
Result_cpu[NgS = 914][NgO = 0]: 79448.6
This pattern is the same, irregardless of the value of NgO. I have been trying to figure out where I have made a mistake by looking at various examples for a few hours and trying out changes, but so far this scheme has worked minus the obvious issue at hand whereas the others have caused kernel invocation errors/left the GPU array uninitialized for all values. Since I clearly cannot see the mistake, I'd really appreciate if someone could point me in the right direction towards a fix. I'm pretty sure it's right under my nose and I can't see it.
In case it matters, I'm testing this code on a Kepler card, compiling using MSVC 2010, CUDA 4.2 and 304.79 driver and have compiled the code with both arch=compute_20,code=sm_20 and arch=compute_30,code=compute_30 flags with no difference.
#vaca_loca: I tested the following kernel (it works for me also with non-square block dimensions):
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgO || m > NgS) return;
int ofs = m * NgO + n;
Result[ofs] = k0 * Radio[ofs];
}
void test() {
int NgS = 1859, NgO = 900;
int data_sz = NgS * NgO, bytes = data_sz * sizeof(double);
cudaSetDevice(0);
double *Radio_cpu = new double [data_sz*3],
*Result_cpu = Radio_cpu + data_sz,
*Result_gpu = Result_cpu + data_sz;
double k0 = -1.7961233;
srand48(time(NULL));
int i, j, n, m;
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Radio_cpu[m + n*NgO] = lrand48() % 234234;
Result_cpu[m + n*NgO] = k0*Radio_cpu[m + n*NgO];
}
}
double *g_Radio, *g_Result;
cudaMalloc((void **)&g_Radio, bytes * 2);
g_Result = g_Radio + data_sz;
cudaMemcpy(g_Radio, Radio_cpu, bytes, cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, g_Radio, g_Result);
cudaMemcpy(Result_gpu, g_Result, bytes, cudaMemcpyDeviceToHost);
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
double c1 = Result_cpu[m + n*NgO],
c2 = Result_gpu[m + n*NgO];
if(std::abs(c1-c2) > 1e-4)
printf("(%d;%d): %.7f %.7f\n", n, m, c1, c2);
}
}
cudaFree(g_Radio);
delete []Radio_cpu;
}
though, in my opinion, accessing data from global memory using quads might not be very cache-friendly since access stride is pretty large. You might consider using 2D textures instead if it's critical for your algorithm to access data in 2D locality

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.