Is mask adaptive in __shfl_up_sync call? - cuda

Basically, it is a materialized version of this post. Suppose a warp need to process 4 objects(say, pixels in image), each 8 lanes are grouped together to process one object:
Now I need do internal shuffle operations during processing one object(i.e. among 8 lanes of this object), it worked for each object just setting mask as 0xff:
uint32_t mask = 0xff;
__shfl_up_sync(mask,val,1);
However, to my understanding, set mask as 0xff will force the lane0:lane7 of object0(or object3? also stuck on this point) to participate, but I ensure that above usage applies to each object after a mass of trials. So, my question is whether __shfl_up_sync call can adapt argument mask to force corresponding lanes participating?
Update
Actually, this problem came from codes of libSGM that I tried to parse. In particular, it solves minimal cost path with dynamic programming in a decently parallel way. Once program reaches this line after launching the kernel aggregate_vertical_path_kernel with execution configuration:
//MAX_DISPARITY is 128 and BLOCK_SIZE is 256
//Basically, each block serves to process 32 pixels in which each warp serves to process 4.
const int gdim = (width + PATHS_PER_BLOCK - 1) / PATHS_PER_BLOCK;
const int bdim = BLOCK_SIZE;
aggregate_vertical_path_kernel<1, MAX_DISPARITY><<<gdim, bdim, 0, stream>>>(...)
An object dp is instantiated from DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE>:
static constexpr unsigned int DP_BLOCK_SIZE = 16u;
...
//MAX_DISPARITY is 128
static const unsigned int SUBGROUP_SIZE = MAX_DISPARITY / DP_BLOCK_SIZE;
...
DynamicProgramming<DP_BLOCK_SIZE, SUBGROUP_SIZE> dp;
Keep following the program, dp.updata() will be invoked in which __shfl_up_sync is used to access the last element of previous DP_BLOCK and __shfl_down_sync is used to access the first element of the rear DP_BLOCK. Besides, each 8 lanes in one warp are grouped together:
//So each 8 threads are grouped together to process one pixel in which each lane is contributed to one DP_BLOCK for corresponding pixel.
const unsigned int lane_id = threadIdx.x % SUBGROUP_SIZE;
Here it comes, once program reaches this line:
//mask is specified as 0xff(255)
const uint32_t prev =__shfl_up_sync(mask, dp[DP_BLOCK_SIZE - 1], 1);
each lane in one warp does shuffle with the same mask 0xff, which causes my above question.

Its confusing when you do this:
lane0:lane7 | lane0:lane7 | lane0:lane7 | lane0:lane7
because a warp doesn't have 4 sets of lanes, that are numbered lane 0 to lane 7. It has one set of lanes, numbered 0 to lane 31.
lane 31 | lane 30 | ... | lane 0
Note that I have ordered the lanes this way because that corresponds to the bit order in the mask. It should be evident which bit corresponds to which lane. bit 0 in the mask parameter corresponds to lane 0, and so on.
This confusion is compounded by the fact that you are only specifying 8 bits, i.e. 8 lanes, in your mask:
uint32_t mask = 0xff;
If you want the warp to have a correct possibility to use all 32 lanes to process all 4 objects, you must specify a 32-bit mask:
uint32_t mask = 0xffffffff;
There is no "adaptation" of an 8-bit mask to apply to each group of 8 lanes in the warp. You must explicitly specify the mask for each of the 32 lanes. This is true even if the width parameter is used (see below).
If you want to cause the shuffle operation to work only in an 8-bit group (with 4 logical shuffles) that is what the width parameter is for:
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
^^^^^

Related

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

Iterative image processing in CUDA

I have written a CUDA kernel to process an image. But depending on the output of the processed image, I have to call the kernel again, to re-tune the image.
For example, let us consider an image having 9 pixels
1 2 3
4 5 6
7 8 9
Suppose that, depending on its neighboring values, the value 9 changes to 10. Since the value has changed, I have to re-process the new image, with the same kernel.
1 2 3
4 5 6
7 8 10
I have already written the algorithm to process the image in a single iteration. The way I'm planning to implement the iterations in CUDA is the following:
__global__ void process_image_GPU(unsigned int *d_input, unsigned int *d_output, int dataH, int dataW, unsigned int *val) {
__shared__ unsigned int sh_map[TOTAL_WIDTH][TOTAL_WIDTH];
// Do processing
// If during processing, anywhere any thread changes the value of the image call
{ atomicAdd(val, 1); }
}
int main(int argc, char *argv[]) {
// Allocate d_input, d_output and call cudaMemcpy
unsigned int *x, *val;
x = (unsigned int *)malloc(sizeof(unsigned int));
x[0] = 0;
cudaMalloc((void **)&val, sizeof(unsigned int));
cudaMemcpy((void *)val, (void *)x, sizeof(unsigned int), cudaMemcpyHostToDevice);
process_image_GPU<<<dimGrid, dimBlock>>>(d_input, d_output, rows, cols, val);
cudaMemcpy((void *)x, (void *)val, sizeof(unsigned int), cudaMemcpyDeviceToHost);
if(x != 0)
// Call the kernel again
}
Is it the only way to do this? Is there any other efficient way to implement the same?
Thanks a lot for your time.
I hazard an answer, despite the almost vanishing information you provided. Hope it helps.
From what you have said, you have already set up an updating rule for your pixels, based on the value of the adjacent pixels. Let x^(k)_ij the value of the pixel number ij at iteration k and let
x^(k+1)_ij = f(x^(k)_(i-1)j, x^(k)_ij, x^(k)_(i+1)j, x^(k)_i(j-1), x^(k)_i(j+1))
I'm assuming the typical stencil-based updating rule, but of course other rules would be possible.
At this point, you have to set up a stopping rule, namely, a rule that indicates if your algorithm has reached convergence. For example, you could evaluate the norm of the difference between the two images at steps k+1 and k.
Once formulated the problem in this way, I would say that you have the following two possibilities:
Rouy-Tourin-like scheme: all the computational pixels are updated in a brute-force way "simultaneously" until convergence is reached;
Fast sweeping method: the computational grid is swept (selective update) along a prefixed number of directions until convergence is reached;
Depending on the kind of problem you are dealing with, I would say that you have the additionl possibility:
Fast iterative method: the computational pixels are selectively updated with the aid of a heap structure.
All the above methods have been compared, for the solution of the eikonal equation, here.
Of course, you will need to show converngence of the above computational schemes for the particular problem of our interest.

Low memory copy throughput Host to Device

I have a vector of vectors vector<vector<double>> data.
I want to copy only the information contained in that "2D matrix" as there are no vectors in CUDA.
So the first approach I used was
vector<vector<double>> *values;
vector<vector<double>>::iterator it;
double *d_values;
double *dst;
checkCudaErr(
cudaMalloc((void**)&d_values, sizeof(double)*M*N)
);
dst = d_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
checkCudaErr(
cudaMemcpy(dst, src, sizeof(double)*s, cudaMemcpyHostToDevice)
);
dst += s;
}
After profiling with NVVP I got a very low cudaMempcpy throughput. I think this is logic as I'm sending a very small amount of
bytes in each cudaMemcpy call.
So I decided to change a little bit the code to try to improve this, so the second approach is
double *h_values = new double[M*N];
dst = h_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
memcpy(dst, src, sizeof(double)*s);
dst += s;
}
checkCudaErr(
cudaMemcpy(d_values, h_values, sizeof(double)*M*N, cudaMemcpyHostToDevice)
);
the result after profiling is still a low memcpy throughput.
So, my question is, how can I improve the copies from host to device?
I'm using a Quadro K4000. I'm getting 25 MB/s for the first case and about 2 GB/s on the second one. M = 5 and N = 2000000. I must say the value for M is a common value, but sometimes it can get up to 50.
A reason for your slow throughput can be that you allocate your double matrix with new. This memory is not page locked. You can either use a system function (dont know which system you use) or the cuda function providing this functionality. It would be cudaMallocHost.
Just remove your =new double[M*N] and set your h_values with cudaMallocHost(&h_values, sizeof(double)*M*N) (and of course dont delete it, but free it (with cudaFreeHost)).
Btw. the theoretical top speed is 8 GB/s (PCI 2.0 x 16 lanes), practical you will stay below it (around 6 GB/s).

cuda register pressure

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,
__global__
void strainAxialKernel(
float* d_dis,
float* d_str
){
int i = threadIdx.x;
float a = 0;
float c = 0;
float e = 0;
float f = 0;
int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
int j;
__shared__ float dis[WINDOW_PER_LINE];
__shared__ float str[WINDOW_PER_LINE];
// fetch data from global memory
dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
__syncthreads();
// least square fit
for (j=-shift; j<NEIGHBOURS-shift; j++)
{
a += j;
c += j*j;
e += dis[i+j];
f += (float(j))*dis[i+j];
}
str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;
// compensate attenuation
if (COMPEN_EXP>0 && COMPEN_BASE>0)
{
str[i]
= (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));
}
// write back to global memory
if (!SIGN_PRESERVE && str[i]<0)
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];
}
else
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];
}
}
I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem.
The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.
There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.
That said, here are some ideas to use fewer registers:
Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
Do a and c have to be floats?
Or, can a and c be removed from the loop and computed once? And thus removed completely?
a is computed as a simple arithmetic sequence, so reduce it... (something like this)
a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2
or
a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2
so instead, do something like the following (you can probably reduce these expressions further):
str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;
Occupancy is NOT a problem.
The SM in GTS 250 (compute capability 1.1) may be able to hold 8 blocks (8x96 threads) simultaneously in its registers, but it only has 8 execution units, meaning that only 8 out of 8x96 (or, in your case, 5x96) threads would be advancing at any given moment of time. There's very little value in trying to squeeze more blocks onto the overloaded SM.
In fact, you could try to play with -maxrregcount option to INCREASE the number of registers, that could have a positive effect on performance.
You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.
For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.
To control the launch bounds for your kernel:
__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
...
}
Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.