(cudaBindTexture2D) How to bind a pitched-array from the middle - cuda

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.

The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

Related

CUDA number of data cannot be divide by the CUDA threads evenly

For example, there are two 4-threads, but I have 5 data, the first 0-3 can be mapped to the first 4-threads, how about the rest, it only says there might be a runtime error, but how to fix it?
I think I ask this question in the wrong direction, now suppose I have
perfromwork<<<2,2>>>;
Now my dataIndex calculated by this pseudocode is smaller than the number of data elements(N=5), so what to do with the last one (5-2x2=1)? If I use another block for it, it will come across the same problem, the <<<2, 2>>> block will create a larger dataIndex.
There are two canonical approaches here.
Size the grid to be larger than or equal to the data set size, and make sure to use a "thread check" that prevents unneeded extra threads from doing any work.
Use a grid-stride loop, which allows the grid size to be determined independently from the data set size (if you wish) while still providing correct results.
vector add example kernels for each:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size) // thread check
z[idx] = x[idx] + y[idx];
}
The above kernel does not use a grid-stride loop. It will require that you size the grid to be larger than or equal to the data set size, in order for all elements to be processed. That sizing code might look like this:
int size = MY_DATA_SET_SIZE;
dim3 block(256); // this is threads per block, the choice here is not critical for correctness, but must be 1 or larger and less than or equal to 1024;
dim3 grid((size+block.x-1)/block.x);
vectorAdd<<<grid,block>>>(...);
A kernel implementing a grid-stride loop to do the same thing might look like this:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < size; idx += blockDim.x*gridDim.x)
z[idx] = x[idx] + y[idx];
}
In this case, grid sizing can be arbitrary (1 or larger) and still yield correct results.

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

generating redheffer matrix using cuda

I have an assignment that requires me to generate Redheffer matrix on GPU using Cuda.
A Redheffer matrix1 is a matrix where each entry a[i][j] is defined by
a[i][j] =
1 if j = 1,
1 if j is divisible by i
0 otherwise.
Here is my code
#define SIZE = 20000
#define BLOCK_WIDTH 16
/* Launch the CUDA kernel */
int numBlocks = ceil(SIZE / BLOCK_WIDTH);
dim3 dimGrid(BLOCK_WIDTH,BLOCK_WIDTH,1);
dim3 dimBlock(numBlocks,numBlocks,1);
redhefferMatrix<<<dimGrid, dimBlock>>>(d_M, SIZE);
I have code to verify if the output is right, it return error message when matrix element value computed is not correct.
When I run my program, I get this error.
GPU number 0 is assigned to this job
Row 0 column 5000 is incorrect. Should be:1 Is actually: 0
My logic to compute values is
int Row= blockIdx.y*blockDim.y + threadIdx.y;
int Col= blockIdx.x*blockDim.x + threadIdx.x;
.
.
if(i < 20000 && j < 20000)
{
{
if(j == 1 || j % i == 0)
d_M[i*SIZE+ j] = 1;
else
d_M[i*SIZE+ j] = 0;
}
}
Can someone give me an idea where i might be wrong. Thank you in advance.
Since you haven't provided a complete code, it's not possible to determine all the issues that may be present. But you have a misinterpretation of block and grid dimensions (you have them reversed):
#define SIZE = 20000
#define BLOCK_WIDTH 16
/* Launch the CUDA kernel */
int numBlocks = ceil(SIZE / BLOCK_WIDTH);
dim3 dimGrid(BLOCK_WIDTH,BLOCK_WIDTH,1);
dim3 dimBlock(numBlocks,numBlocks,1);
redhefferMatrix<<<dimGrid, dimBlock>>>(d_M, SIZE);
The first kernel configuration parameter should be the dimensions of the grid in terms of number of blocks (in x and y, in this case). Your first kernel config parameter is dimGrid which you have defined as a dim3(BLOCK_WIDTH,BLOCK_WIDTH) quantity, i.e. 16x16 blocks. That's not what you intended I don't think, but not actually illegal.
Your second kernel configuration parameter should be the dimensions of the block in terms of number of threads (in x and y, in this case). Your second kernel parameter is dimBlocks, which you have defined as a dim3(20000/16, 20000/16) quantity, i.e. 1250x1250 threads. This is illegal, as CUDA threadblocks are limited to a total of 1024 threads, i.e. the product of the dimensions cannot exceed 1024.
So your kernel launch is illegal and your kernel is not even running. If you use proper cuda error checking and/or run your code with cuda-memcheck, you would discover this.
The fix may be fairly simple - reverse your sense of these config parameters:
dim3 dimBlock(BLOCK_WIDTH,BLOCK_WIDTH,1);
dim3 dimGrid(numBlocks,numBlocks,1);
Again, I cannot say this is the only issue, since you have not shown a complete code that I could actually test (which SO expects for questions like this.)
If you make the above change and things are still not working, I would suggest the following:
Add the proper cuda error checking and run your code with cuda-memcheck as I already suggested.
Provide a complete MCVE, i.e. a complete code that somebody else could copy, paste, and run. Also provide whatever is the output of the cuda-memcheck and error-checking on your system.
You should do the above 2 things before you ask for debugging help here on SO.

Handling Boundary Conditions in OpenCL/CUDA

Given a 3D uniform grid, I would like to set the values of the border cells relative to the values of their nearest neighbor inside the grid. E.g., given a 10x10x10 grid, for a voxel at coordinate (0, 8, 8), I'd like to set a value as follows : val(0, 8, 8)=a*val(1,8,8).
Since, a could be any real number, I do not think texture + samplers can be used in this case. In addition, the method should work on normal buffers as well.
Also, since a boundary voxel coordinate could be either part of the grid's corner, edge, or face, 26 (= 8 + 12 + 6) different choices for looking up the nearest neighbor exist (e.g. if the coordinate was at (0,0,0) its nearest neighbor insided the grid would be (1, 1, 1)). So there is a lot of potential branching.
Is there a "elegant" way to accomplish this in OpenCL/CUDA? Also, is it advisable to handle boundary using a seperate kernel?
The most usual way of handling borders in CUDA is to check for all possible border conditions and act accordingly, that is:
If "this element" is out of bounds, then return (this is very useful in CUDA, where you will probably launch more threads than strictly necessary, so the extra threads must exit early in order to avoid writing on out-of-bounds memory).
If "this element" is at/near left border (minimum x) then do special operations for left border.
Same for right, up, down (and front and back, in 3D) borders.
Fortunately, on most occasions you can use max/min to simplify these operations, so you avoid too many ifs. I like to use an expression of this form:
source_pixel_x = max(0, min(thread_2D_pos.x + j, MAX_X));
source_pixel_y = ... // you get the idea
The result of these expressions is always bound between 0 and some MAX, thus clamping the out_of_bounds source pixels to the border pixels.
EDIT: As commented by DarkZeros, it is easier (and less error prone) to use the clamp() function. Not only it checks both min and max, it also allows vector types like float3 and clamps each dimension separately. See: clamp
Here is an example I did as an exercise, a 2D gaussian blur:
__global__
void gaussian_blur(const unsigned char* const inputChannel,
unsigned char* const outputChannel,
int numRows, int numCols,
const float* const filter, const int filterWidth)
{
const int2 thread_2D_pos = make_int2( blockIdx.x * blockDim.x + threadIdx.x,
blockIdx.y * blockDim.y + threadIdx.y);
const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;
if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows)
{
return; // "this output pixel" is out-of-bounds. Do not compute
}
int j, k, jn, kn, filterIndex = 0;
float value = 0.0;
int2 pixel_2D_pos;
int pixel_1D_pos;
// Now we'll process input pixels.
// Note the use of max(0, min(thread_2D_pos.x + j, numCols-1)),
// which is a way to clamp the coordinates to the borders.
for(k = -filterWidth/2; k <= filterWidth/2; ++k)
{
pixel_2D_pos.y = max(0, min(thread_2D_pos.y + k, numRows-1));
for(j = -filterWidth/2; j <= filterWidth/2; ++j,++filterIndex)
{
pixel_2D_pos.x = max(0, min(thread_2D_pos.x + j, numCols-1));
pixel_1D_pos = pixel_2D_pos.y * numCols + pixel_2D_pos.x;
value += ((float)(inputChannel[pixel_1D_pos])) * filter[filterIndex];
}
}
outputChannel[thread_1D_pos] = (unsigned char)value;
}
In OpenCL you could use Image3d to handle your 3d grid. Boundary handling could be achived with a sampler and a specific adress mode:
CLK_ADDRESS_REPEAT - out-of-range image coordinates are wrapped to the valid range. This address mode can only be used with normalized coordinates. If normalized coordinates are not used, this addressing mode may generate image coordinates that are undefined.
CLK_ADDRESS_CLAMP_TO_EDGE - out-of-range image coordinates are clamped to the extent.
CLK_ADDRESS_CLAMP32 - out-of-range image coordinates will return a border color. The border color is (0.0f, 0.0f, 0.0f, 0.0f) if image channel order is CL_A, CL_INTENSITY, CL_RA, CL_ARGB, CL_BGRA or CL_RGBA and is (0.0f, 0.0f, 0.0f, 1.0f) if image channel order is CL_R, CL_RG, CL_RGB or CL_LUMINANCE.
CLK_ADDRESS_NONE - for this address mode the programmer guarantees that the image coordinates used to sample elements of the image refer to a location inside the image; otherwise the results are undefined.
Additionally you can define the filter mode for the interpolation (nearest neighbor or linear).
Does this fit your needs? Otherwise, please give us more detail about you data and its boundary requirements.

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x. But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.
From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:
int id = threadIdx.x;
This id is independant of blockIdx, and therefore identical across blocks, no matter their number.
If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.
Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl. Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.