Only half of the shared memory array is assigned - cuda

I see only half of the shared memory array is assigned, when I use Nsight stepped after s_f[sidx] = 5;
__global__ void BackProjectPixel(double* val,
double* projection,
double* focalPtPos,
double* pxlPos,
double* pxlGrid,
double* detPos,
double *detGridPos,
unsigned int nN,
unsigned int nS,
double perModDetAngle,
double perModSpaceAngle,
double perModAngle)
{
const double fx = focalPtPos[0];
const double fy = focalPtPos[1];
//extern __shared__ double s_f[64]; //
__shared__ double s_f[64]; //
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int j = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int idx = j*nN + i;
unsigned int sidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int threadsPerSharedMem = 64;
if (sidx < threadsPerSharedMem)
{
s_f[sidx] = 5;
}
__syncthreads();
//double * angle;
//
if (sidx < threadsPerSharedMem)
{
s_f[idx] = TriPointAngle(detGridPos[0], detGridPos[1],fx, fy, pxlPos[idx*2], pxlPos[idx*2+1], nN);
}
}
Here is what I observed
I am wondering why there are only thirty-two 5? Shouldn't there be sixty-four 5 in s_f? Thanks.

Threads are executed in groups of threads (usually 32) which are also called warps. Warps group the threads in order. In your case one warp will get threads 0-31, the other 32-63. In your debugging context, you are probably seeing the results of only the warp that contains threads 0-31.

I am wondering why there are only thirty-two 5?
There are 32 fives because as mete says, kernels are executed simultaneously only by groups of threads of size 32, so called warps in CUDA terminology.
Shouldn't there be sixty-four 5 in s_f?
There will be 64 fives after the synchronization barrier, i.e. __syncthreads(). So if you place your breakpoint on the first instruction after the __syncthreads() call, you'll see all fives. Thats because by that time all the warps from one block will finish execution of all the code prior to __syncthreads().
How can I see all warps with Nsight?
You can see values for all the threads easily by putting this into watchfield:
s_f[sidx]
Although sidx value may become undefined due to optimizations, so I would better watch the value of:
s_f[((blockIdx.y * blockDim.y) + threadIdx.y) * nN + (blockIdx.x * blockDim.x) + threadIdx.x]
And indeed, if you want to investigate values for particular warp, then as Robert Crovella points out, you should use conditional breakpoints. If you want to break within the second warp, then something like this should work in case of two dimensional grid of two dimensional block (which I presume you are using):
((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x) == 32
Because 32 is the index of the first thread within the second warp. For other combinations of block and grid dimensions see this useful cheatsheet.

Related

Modifying the basic example VECADD to use the shared memory

I wrote the following kernel to use the shared memory into the basic CUDA example vecadd (sum of two vectors). The code works, but the elapsed time for the kernel execution is the same as the basic original code. May someone suggest me a way to easily speed up such a code?
__global__ void vecAdd(float *in1, float *in2, float *out,long int len)
{
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];
unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];
out[xIndex]=s_in1[threadIdx.x]+s_in2[threadIdx.x];
}
May someone suggest me a way to easily speed up such a code
There are basically no useful optimizations to make on an operation like vector addition. Because of the nature of the calculation, the code could only ever hope to reach 50% peak arithmetic throughput, and the requirement for three memory transactions per FLOP makes this an intrinsically memory bandwidth bound operation.
As a result, this:
__global__ void vecAdd(float *in1, float *in2, float *out, unsigned int len)
{
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
if (xIndex < len) {
float x = in1[xIndex];
float y = in2[xIndex];
out[xIndex] = x + y;
}
}
is about the best performing variant on most recent hardware, if the block size is selected for maximum occupancy, and len is sufficiently large for example:
int minGrid, minBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &minBlockSize, vecAdd);
int nblocks = (len / minBlockSize) + ((len % minBlockSize > 0) ? 1 : 0);
vecAdd<<<nblocks, minBlockSize>>>(x, y, z, len);

read 4 char per thread in 1 transaction in cuda

I am learning CUDA recently. And I have a question about the memory transaction.
What I understand is, in each transaction, 32 consecutive threads (in the same block) can access a consecutive 128 bytes (32 single precision words) of memory concurrently, which is called a warp.
But in the example, each thread is always accessing the (4-bytes) word as 1 whole variable. So my question is, if my array in global memory is defined in type for char, then can all the 32 threads access this piece of memory, and read 4 consecutive char respectively in the same time?
So, for eaxmple, if I write the code:
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = 4 * (threadIdx.x + blockDim.x * blockIdx.x);
s_tmp[thId_x] = d_mask[elementId];
s_tmp[1 + thId_x] = d_mask[elementId + 1];
s_tmp[2 + thId_x] = d_mask[elementId + 2];
s_tmp[3 + thId_x] = d_mask[elementId + 3];
__syncthreads();
/* calculation */
}
Then, will each thread read the 4 bytes concurrently? And if not, how can I manage to do it? should I use some API like memcpy?
In order to get a properly efficient read, it's necessary to combine the bytes being read into a single transaction; we generally can't do this by breaking things up across several lines of code.
To combine things into a single transaction, there are vector types which combine multiple elements into a single type. As long as we pay attention to proper alignment, we can treat char or unsigned char arrays as arrays of e.g. uchar4 which is a vector type that combines four characters into a single (32-bit) type. You can find lots more goodies in the cuda header files vector_types.h and vector_functions.h.
Anyway, we could re-write your sample like this, to take advantage of a "vector load":
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = threadIdx.x + blockDim.x * blockIdx.x;
uchar4 *s_tmp_v = reinterpret_cast<uchar4 *>(s_tmp);
uchar4 *d_mask_v = reinterpret_cast<uchar4 *>(d_mask);
s_tmp_v[thId] = d_mask_v[elementId];
__syncthreads();
/* calculation */
}

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

CUDA 2D Convolution kernel

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);
First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;