Determine dimGrid in CUDA - cuda

I'm tiling a 2D Matrix into blocks of fixed size BLOCK_DIM 16*16. Then I found that dimGrid is (from internet):
dim3 dimGrid((NColumns - 1)/16 + 1, (NRows - 1)/16 + 1).
Isn't this reversed? Shouldn't it be Nrows first?

If I were writing the code, I would probably write it the way you have shown.
I think of x,y cartesian space this way:
Y
^
|
|
+------->X
That is, the "X" axis is the "horizontal" axis and the "Y" axis is the vertical axis. There is no reason it has to be this way -- it's just a mental model. But I think it's fairly common.
Now, If the x,y space is used to represent a 2D image, then as I move from right to left (i.e. along the horizontal axis) I am moving from one column to another in the image. As I move up and down (i.e. along the vertical axis) I am moving from one row to another in the image.
Therefore, with this mental model, the Y coordinate indicates the row and the X coordinate indicates the column of the image. The X coordinate will therefore have a maximum (logical) value equal to the number of columns in the image, and the Y coordinate will have a maximum value equal to the number of rows in the image. For the proposed dimGrid variable definition:
dim3 dimGrid((NColumns - 1)/16 + 1, (NRows - 1)/16 + 1).
since the x grid dimension appears first, we see that this "mental model" is consistent with the definition of dimGrid.
This sort of usage also would typically mean that for an image-processing algorithm in CUDA, adjacent threads in X would have "naturally"-calculated 2D indices:
int idx = threadIdx.x + blockDim.x*blockIdx.x;
int idy = threadIdx.y + blockDim.y*blockIdx.y;
or:
int col = threadIdx.x + blockDim.x*blockIdx.x;
int row = threadIdx.y + blockDim.y*blockIdx.y;
such that they would have adjacent X values in the image or adjacent "columns". In C-style row-major storage, having adjacent threads in X (in the grid) access adjacent columns in the image is usually a good recipe for achieving coalesced access in your kernel.

Related

How to perform padding operation if the kernel stride is greater than the input shape dimensions in case of Maxpooling

I am trying to perform Maxpooling operation in caffe.
Input size is 6 x 6 x 1 x 1024 whereas the kernel size is 7 x 7.
Am i supposed to do padding inorder to perform MaxPooling.
First, you haven't specified the stride; the kernel dimensions are 7x7, larger than your input, but that's the size, not the stride. Stride is how far you move between iterations, such as shifting 2 pixels (which effectively halves the size of the output).
You probably want to pad so that the center of the kernel (element (3, 3) with 0-based indexing) can be over each pixel of your input. This means that you need a 3-pixel pad ( (7-1)/2 ) in each direction.
Is that what you needed?

CUDA C programing guide: how do thread and block indexing calculations work?

in the CUDA_C_Programming_Guide,Chapter2,Thread Hierarchy
__global__ void MatAdd(float A[N][N],float B[N][N],float C[N][N])
{
int i=blockId.x*blockDim.x+threadIdx.x;
int j=blockId.y*blockDim.y+threadIdx.y;
if(i<N&&j<N)
C[i][j]=A[i][j]+B[i][j];
}
int main()
{
....
dim3 threadPerBlock(16,16);
dim3 numBlock(N/threadPerBlcok.x,N/threadPerBlock.y);
MatAdd<<<numBlocks,threadPerBlock>>>(A,B,C);
....
}
I'm a fresh man to this,can't make sense of "int i=blockIdx.x*blockDim.x+threadIdx.x".Why can be this?
Is there anyone can explain it to me?
Thanks a lot.
For example,how to confirm the Thread(1,1) in Block(1,1) using "i" and "j"?
I find the answer in the << CUDA Programming: A Developer's Guide to Parallel Computing with GPUs >> autor:Shane Cook.
In chapter 5,there is a clear explaination of that.
As to 2D array, we need dim3 to create a 2D layout threads.
"dim3 threadPerBlock(16,16)" means that a single block has 16 threads in its x axis and 16 threads in Y axis.
"dim3 numBlocks(N/threadPerBlock.x,N/threadPerBlock.y)" means that a single grid has N/threadPerBlock.x block along x axis and N/threadPerBlock.y along the y axis.
gridDim.x or gridDim.y means how many blocks along the x/y axis in a grid.
blockDim.x or blockDim.y means how many threads along the x/y axis in a block.
threadIdx.x or threadIdx.y means the thread index along the x/u axis in a block.
blockIdx.x or block.idx.y means the block index along the x/y axis in a grid.
so if we want to know the absolute thread index,we should know how many blocks behind current block and how many threads behind current thread (row*(sizeof(array_element)*width)))+((sizeof(array_element)*offset)). So we get i= blockIdx.x*blockDim.x+threadIdx.x .
there is a picture show grid,block and thread dimensions.
enter image description here

Libgdx - Optimizing the usage of TextureRegion, also is there an upper bound on texture size?

In an attempt to make my stuff faster, I was going to put all my graphics and sprites on a single huge texture (as recommended). However this would include all the graphics, not just a tile map of the same sized dimension (so it'd have 4x4's, 16x16's, maybe some 256x128's ...etc).
1) Is there an upper bound of the size a Texture can be? Can I make for example, 256 x 393216 pixel texture from a Pixmap and be okay with passing it off to the GPU? No performance penalties or any unexpected weirdness I should expect? Is there an upper bound to the size I can send or is that strictly dependent on hardware memory constraints?
2) Does it matter where my TextureRegion x/y offsets are? Example:
int x = 16;
int y = 23; // Should this be some power of two?
int width = 16;
int height = 16;
new TextureRegion(x, y, width, height);
Is the above okay for speed as long as the width/height are powers of two? Or should I align x and y with some power of two boundary?

How to handle boundaries in conv/pool of conv-nets?

When convolution uses a kernel size of 4 and stride size of 4, meanwhile, the input size is only 10, it will be fail when trying to do third convolution operation on the boundary of input, so, should the input padded with zeros on boundary implicitly to avoid this problem? Is there any problem when I padded with other real numbers? Is it equals to increase the input size automatically?
Besides, if I expected to get a same size output feature map, usually kernel size of 3 and pad size of 1 can be used, but when kernel size is a odd number, how to decide the pad size on each side of input?
Yes, the input must be padded with zeros to overcome the small input image size problem. To compute the output feature maps at each level use the following formula:
H_out = ( H_in + 2 x Padding_Height - Kernel_Height ) / Stride_Height + 1
W_out = (W_in + 2 x Padding_Width - Kernel_Width) / Stride_Width + 1
You may keep the padding in accordance with the above formula.

Practice computing grid size for CUDA

dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows.
Can you explain how the grid size is computed? Why is block.x added to nx and then subtracted by 1?
There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up.
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
Try a few examples: If nx = 10, then nx + block.x - 1 is 13, and by integer divison, you need 3 blocks of size 4.
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3