dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows.
Can you explain how the grid size is computed? Why is block.x added to nx and then subtracted by 1?
There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up.
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
Try a few examples: If nx = 10, then nx + block.x - 1 is 13, and by integer divison, you need 3 blocks of size 4.
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3
Related
When the kernel size is odd, we can manually calculate the necessary padding to get the output in the same dimension as input such that it creates same padding.
But how can we calculate padding dimensions for kernels with even sizes (ex: (2x2)?
note these the 2 formula's
pad= (filter_size - 1 )/ 2
o/p feature map dimension= (i/p feature map dimension - filter_size + 2(pad))/stride + 1
lets assume u have i/p dimension of 28x28, and you want same padding implies your o/p
dimension to be same i.e 28x28.
and i am assuming your stride as 1
let us come to calculation of padding amount,
pad = (2 - 1) / 2
= 1 / 2
substituting this value to equation 2)
o/p feature map=(28 - 2 + 2(1/2))/1 + 1
=28
Hence the last answer is your dimension of your o/p feature map,(hence verified)
I used padding as 1 and dilation as 2 which resulted same padding.
I am trying to perform Maxpooling operation in caffe.
Input size is 6 x 6 x 1 x 1024 whereas the kernel size is 7 x 7.
Am i supposed to do padding inorder to perform MaxPooling.
First, you haven't specified the stride; the kernel dimensions are 7x7, larger than your input, but that's the size, not the stride. Stride is how far you move between iterations, such as shifting 2 pixels (which effectively halves the size of the output).
You probably want to pad so that the center of the kernel (element (3, 3) with 0-based indexing) can be over each pixel of your input. This means that you need a 3-pixel pad ( (7-1)/2 ) in each direction.
Is that what you needed?
in the CUDA_C_Programming_Guide,Chapter2,Thread Hierarchy
__global__ void MatAdd(float A[N][N],float B[N][N],float C[N][N])
{
int i=blockId.x*blockDim.x+threadIdx.x;
int j=blockId.y*blockDim.y+threadIdx.y;
if(i<N&&j<N)
C[i][j]=A[i][j]+B[i][j];
}
int main()
{
....
dim3 threadPerBlock(16,16);
dim3 numBlock(N/threadPerBlcok.x,N/threadPerBlock.y);
MatAdd<<<numBlocks,threadPerBlock>>>(A,B,C);
....
}
I'm a fresh man to this,can't make sense of "int i=blockIdx.x*blockDim.x+threadIdx.x".Why can be this?
Is there anyone can explain it to me?
Thanks a lot.
For example,how to confirm the Thread(1,1) in Block(1,1) using "i" and "j"?
I find the answer in the << CUDA Programming: A Developer's Guide to Parallel Computing with GPUs >> autor:Shane Cook.
In chapter 5,there is a clear explaination of that.
As to 2D array, we need dim3 to create a 2D layout threads.
"dim3 threadPerBlock(16,16)" means that a single block has 16 threads in its x axis and 16 threads in Y axis.
"dim3 numBlocks(N/threadPerBlock.x,N/threadPerBlock.y)" means that a single grid has N/threadPerBlock.x block along x axis and N/threadPerBlock.y along the y axis.
gridDim.x or gridDim.y means how many blocks along the x/y axis in a grid.
blockDim.x or blockDim.y means how many threads along the x/y axis in a block.
threadIdx.x or threadIdx.y means the thread index along the x/u axis in a block.
blockIdx.x or block.idx.y means the block index along the x/y axis in a grid.
so if we want to know the absolute thread index,we should know how many blocks behind current block and how many threads behind current thread (row*(sizeof(array_element)*width)))+((sizeof(array_element)*offset)). So we get i= blockIdx.x*blockDim.x+threadIdx.x .
there is a picture show grid,block and thread dimensions.
enter image description here
When convolution uses a kernel size of 4 and stride size of 4, meanwhile, the input size is only 10, it will be fail when trying to do third convolution operation on the boundary of input, so, should the input padded with zeros on boundary implicitly to avoid this problem? Is there any problem when I padded with other real numbers? Is it equals to increase the input size automatically?
Besides, if I expected to get a same size output feature map, usually kernel size of 3 and pad size of 1 can be used, but when kernel size is a odd number, how to decide the pad size on each side of input?
Yes, the input must be padded with zeros to overcome the small input image size problem. To compute the output feature maps at each level use the following formula:
H_out = ( H_in + 2 x Padding_Height - Kernel_Height ) / Stride_Height + 1
W_out = (W_in + 2 x Padding_Width - Kernel_Width) / Stride_Width + 1
You may keep the padding in accordance with the above formula.
I'm tiling a 2D Matrix into blocks of fixed size BLOCK_DIM 16*16. Then I found that dimGrid is (from internet):
dim3 dimGrid((NColumns - 1)/16 + 1, (NRows - 1)/16 + 1).
Isn't this reversed? Shouldn't it be Nrows first?
If I were writing the code, I would probably write it the way you have shown.
I think of x,y cartesian space this way:
Y
^
|
|
+------->X
That is, the "X" axis is the "horizontal" axis and the "Y" axis is the vertical axis. There is no reason it has to be this way -- it's just a mental model. But I think it's fairly common.
Now, If the x,y space is used to represent a 2D image, then as I move from right to left (i.e. along the horizontal axis) I am moving from one column to another in the image. As I move up and down (i.e. along the vertical axis) I am moving from one row to another in the image.
Therefore, with this mental model, the Y coordinate indicates the row and the X coordinate indicates the column of the image. The X coordinate will therefore have a maximum (logical) value equal to the number of columns in the image, and the Y coordinate will have a maximum value equal to the number of rows in the image. For the proposed dimGrid variable definition:
dim3 dimGrid((NColumns - 1)/16 + 1, (NRows - 1)/16 + 1).
since the x grid dimension appears first, we see that this "mental model" is consistent with the definition of dimGrid.
This sort of usage also would typically mean that for an image-processing algorithm in CUDA, adjacent threads in X would have "naturally"-calculated 2D indices:
int idx = threadIdx.x + blockDim.x*blockIdx.x;
int idy = threadIdx.y + blockDim.y*blockIdx.y;
or:
int col = threadIdx.x + blockDim.x*blockIdx.x;
int row = threadIdx.y + blockDim.y*blockIdx.y;
such that they would have adjacent X values in the image or adjacent "columns". In C-style row-major storage, having adjacent threads in X (in the grid) access adjacent columns in the image is usually a good recipe for achieving coalesced access in your kernel.