Handling Boundary Conditions in OpenCL/CUDA - cuda

Given a 3D uniform grid, I would like to set the values of the border cells relative to the values of their nearest neighbor inside the grid. E.g., given a 10x10x10 grid, for a voxel at coordinate (0, 8, 8), I'd like to set a value as follows : val(0, 8, 8)=a*val(1,8,8).
Since, a could be any real number, I do not think texture + samplers can be used in this case. In addition, the method should work on normal buffers as well.
Also, since a boundary voxel coordinate could be either part of the grid's corner, edge, or face, 26 (= 8 + 12 + 6) different choices for looking up the nearest neighbor exist (e.g. if the coordinate was at (0,0,0) its nearest neighbor insided the grid would be (1, 1, 1)). So there is a lot of potential branching.
Is there a "elegant" way to accomplish this in OpenCL/CUDA? Also, is it advisable to handle boundary using a seperate kernel?

The most usual way of handling borders in CUDA is to check for all possible border conditions and act accordingly, that is:
If "this element" is out of bounds, then return (this is very useful in CUDA, where you will probably launch more threads than strictly necessary, so the extra threads must exit early in order to avoid writing on out-of-bounds memory).
If "this element" is at/near left border (minimum x) then do special operations for left border.
Same for right, up, down (and front and back, in 3D) borders.
Fortunately, on most occasions you can use max/min to simplify these operations, so you avoid too many ifs. I like to use an expression of this form:
source_pixel_x = max(0, min(thread_2D_pos.x + j, MAX_X));
source_pixel_y = ... // you get the idea
The result of these expressions is always bound between 0 and some MAX, thus clamping the out_of_bounds source pixels to the border pixels.
EDIT: As commented by DarkZeros, it is easier (and less error prone) to use the clamp() function. Not only it checks both min and max, it also allows vector types like float3 and clamps each dimension separately. See: clamp
Here is an example I did as an exercise, a 2D gaussian blur:
__global__
void gaussian_blur(const unsigned char* const inputChannel,
unsigned char* const outputChannel,
int numRows, int numCols,
const float* const filter, const int filterWidth)
{
const int2 thread_2D_pos = make_int2( blockIdx.x * blockDim.x + threadIdx.x,
blockIdx.y * blockDim.y + threadIdx.y);
const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;
if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows)
{
return; // "this output pixel" is out-of-bounds. Do not compute
}
int j, k, jn, kn, filterIndex = 0;
float value = 0.0;
int2 pixel_2D_pos;
int pixel_1D_pos;
// Now we'll process input pixels.
// Note the use of max(0, min(thread_2D_pos.x + j, numCols-1)),
// which is a way to clamp the coordinates to the borders.
for(k = -filterWidth/2; k <= filterWidth/2; ++k)
{
pixel_2D_pos.y = max(0, min(thread_2D_pos.y + k, numRows-1));
for(j = -filterWidth/2; j <= filterWidth/2; ++j,++filterIndex)
{
pixel_2D_pos.x = max(0, min(thread_2D_pos.x + j, numCols-1));
pixel_1D_pos = pixel_2D_pos.y * numCols + pixel_2D_pos.x;
value += ((float)(inputChannel[pixel_1D_pos])) * filter[filterIndex];
}
}
outputChannel[thread_1D_pos] = (unsigned char)value;
}

In OpenCL you could use Image3d to handle your 3d grid. Boundary handling could be achived with a sampler and a specific adress mode:
CLK_ADDRESS_REPEAT - out-of-range image coordinates are wrapped to the valid range. This address mode can only be used with normalized coordinates. If normalized coordinates are not used, this addressing mode may generate image coordinates that are undefined.
CLK_ADDRESS_CLAMP_TO_EDGE - out-of-range image coordinates are clamped to the extent.
CLK_ADDRESS_CLAMP32 - out-of-range image coordinates will return a border color. The border color is (0.0f, 0.0f, 0.0f, 0.0f) if image channel order is CL_A, CL_INTENSITY, CL_RA, CL_ARGB, CL_BGRA or CL_RGBA and is (0.0f, 0.0f, 0.0f, 1.0f) if image channel order is CL_R, CL_RG, CL_RGB or CL_LUMINANCE.
CLK_ADDRESS_NONE - for this address mode the programmer guarantees that the image coordinates used to sample elements of the image refer to a location inside the image; otherwise the results are undefined.
Additionally you can define the filter mode for the interpolation (nearest neighbor or linear).
Does this fit your needs? Otherwise, please give us more detail about you data and its boundary requirements.

Related

Concurrent Writing CUDA

I am new to CUDA and I am facing a problem with a basic projection kernel. What I am trying to do is to project a 3D point cloud into a 2D image. In case multiple points project to the same pixel, only the point with the smallest depth (the closest one) should be written on the matrix.
Suppose two 3D points fall in an image pixel (0, 0), the way I am implementing the depth check here is not working if (depth > entry.depth), since the two threads (from two different blocks) execute this "in parallel". In the printf statement, in fact, both entry.depth give the numeric limit (the initialization value).
To solve this problem I thought of using a tensor-like structure, each image pixel corresponds to an array of values. After the array is reduced and only the point with the smallest depth is kept. Are there any smarter and more efficient ways of solving this problem?
__global__ void kernel_project(CUDAWorkspace* workspace_, const CUDAMatrix* matrix_) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= matrix_->size())
return;
const Point3& full_point = matrix_->at(tid);
float depth = 0.f;
Point2 image_point;
// full point as input, depth and image point as output
const bool& is_good = project(image_point, depth, full_point); // dst, dst, src
if (!is_good)
return;
const int irow = (int) image_point.y();
const int icol = (int) image_point.x();
if (!workspace_->inside(irow, icol)) {
return;
}
// get pointer to entry
WorkspaceEntry& entry = (*workspace_)(irow, icol);
// entry.depth is set initially to a numeric limit
if (depth > entry.depth) // PROBLEM HERE
return;
printf("entry depth %f\n", entry.depth) // BOTH PRINT THE NUMERIC LIMIT
entry.point = point;
entry.depth = depth;
}

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

How to calculate separable convolutions in the Xception paper?

I read the Xception paper (there's even a Keras Model for the describe NN) and it talks about separable convolutions.
I was trying to understand how exactly they are calculated. Rather than leaving it to imprecise words, I have included the piece of pseudo-code below that summarizes my understanding. The code maps from a feature map 18x18x728 to a 18x18x1024 one :
XSIZE = 18;
YSIZE = 18;
ZSIZE = 728;
ZSIXE2 = 1024;
float mapin[XSIZE][YSIZE][ZSIZE]; // Input map
float imap[XSIZE][YSIZE][ZSIZE2]; // Intermediate map
float mapout[XSIZE][YSIZE][ZSIZE2]; // Output map
float wz[ZSIZE][ZSIZE2]; // Weights for 1x1 convs
float wxy[3][3][ZSIZE2]; // Weights for 3x3 convs
// Apply 1x1 convs
for(y=0;y<YSIZE;y++)
for(x=0;x<XSIZE;x++)
for(o=0;o<ZSIZE2;o++){
s=0.0;
for(z=0;z<ZSIZE;z++)
s+=mapin[x][y][z]*wz[z][o];
imap[x][y][o]=s;
}
// Apply 2D 3x3 convs
for(o=0;o<ZSIZE2;o++)
for(y=0y<YSIZE;y++)
for(x=0;x<XSIZE;x++){
s=0.0;
for(i=-1;i<2;i++)
for(j=-1;j<2;j++)
s+=imap[x+j][y+i][o]*wxy[j+1][i+1][o]; // This value is 0 if falls off the edge
mapout[x][y][o]=s;
}
Is this correct ? If not, can you suggest fixes similarly written in C or pseudo-C ?
Thank you very much in advance.
I found tf.nn.separable_conv2d in Tensorflow that does exactly this. So I built a very simple graph and, with the help of random numbers, I tried to get the code above to match the result. The correct code is :
XSIZE = 18;
YSIZE = 18;
ZSIZE = 728;
ZSIXE2 = 1024;
float mapin[XSIZE][YSIZE][ZSIZE]; // Input map
float imap[XSIZE][YSIZE][ZSIZE]; // Intermediate map
float mapout[XSIZE][YSIZE][ZSIZE2]; // Output map
float wxy[3][3][ZSIZE]; // Weights for 3x3 convs
float wz[ZSIZE][ZSIZE2]; // Weights for 1x1 convs
// Apply 2D 3x3 convs
for(o=0;o<ZSIZE;o++)
for(y=0y<YSIZE;y++)
for(x=0;x<XSIZE;x++){
s=0.0;
for(i=-1;i<2;i++)
for(j=-1;j<2;j++)
s+=mapin[x+j][y+i][o]*wxy[j+1][i+1][o]; // This value is 0 if falls off the edge
imap[x][y][o]=s;
}
// Apply 1x1 convs
for(y=0;y<YSIZE;y++)
for(x=0;x<XSIZE;x++)
for(o=0;o<ZSIZE2;o++){
s=0.0;
for(z=0;z<ZSIZE;z++)
s+=imap[x][y][z]*wz[z][o];
mapout[x][y][o]=s;
}
The main difference is in the order that the two groups of convolutions are performed.
To my surprise, the order is important even when ZSIZE==ZSIZE2.

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x. But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.
From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:
int id = threadIdx.x;
This id is independant of blockIdx, and therefore identical across blocks, no matter their number.
If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.
Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl. Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.

cublasSetVector() vs cudaMemcpy()

I am wondering if there is a difference between:
// cumalloc.c - Create a device on the device
HOST float * cudamath_vector(const float * h_vector, const int m)
{
float *d_vector = NULL;
cudaError_t cudaStatus;
cublasStatus_t cublasStatus;
cudaStatus = cudaMalloc(&d_vector, sizeof(float) * m );
if(cudaStatus == cudaErrorMemoryAllocation) {
printf("ERROR: cumalloc.cu, cudamath_vector() : cudaErrorMemoryAllocation");
return NULL;
}
/* THIS: */ cublasSetVector(m, sizeof(*d_vector), h_vector, 1, d_vector, 1);
/* OR THAT: */ cudaMemcpy(d_vector, h_vector, sizeof(float) * m, cudaMemcpyHostToDevice);
return d_vector;
}
cublasSetVector() has two arguments incx and incy and the documentation says:
The storage spacing between consecutive elements is given by incx for
the source vector x and for the destination vector y.
In the NVIDIA forum someone said:
iona_me: "incx and incy are strides measured in floats."
So does this mean that for incx = incy = 1 all elements of a float[] will be sizeof(float)-aligned and for incx = incy = 2 there would be a sizeof(float)-padding between each element?
Except for those two parameters and the cublasHandle - does cublasSetVector() anything else what cudaMalloc() doesn't do?
Would it be save to pass a vector/matrix which was not created with their respective cublas*() function to other CUBLAS functions to manipulate them?
There is a comment in a thread of the NVIDIA Forum provided by Massimiliano Fatica confirming my statement in the above comment (or, saying it better, my comment originated by a recall of having read the post I linked to). In particular
cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no significant performance differences are expected between the two sets of copy functions.
Accordingly, you can safely pass any array created by cudaMalloc as input to cublasSetVector.
Concerning the strides, perhaps there is a misprint in the guide (as of CUDA 6.0), which says that
The storage spacing between consecutive elements is given by incx for the source
vector x and for the destination vector y.
but perhaps should be read as
The storage spacing between consecutive elements is given by incx for the source
vector x and incy for the destination vector y.