CUDA: writing to shared memory increses kernel time execution a lot - cuda

I am trying to reduce 65536 elements array (calculate sum of elements in it) with help of CUDA. Kernel looks like following (please, ignore *dev_distanceFloats and index arguments for now)
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
}
ant it launched as one block with 256 threads:
kernel_calcSum <<<1, 256 >>>(dev_spFloats1, dev_distanceFloats, index);
So far, so good, each of 256 threads takes 256 elements from global memory and calculates it's sum in local variable mySum. Kernel execution time is about 45 milliseconds.
Next step is to introduce shared memory among those 256 threads in block (to calculate sum of mySum), so kernel becomes as following:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = mySum;
}
I just added writing to shared memory, but execution time increases from 45 milliseconds to 258 milliseconds (I am cheking this with help of NVidia Visual Profiler 5.0.0).
I realize that there are 8 bank conflicts for each thread when writing to sdata variable (I am on GTX670 which have capability 3.0 with 32 banks). As an experiment - I tried to reduce of threads to 32 when launching kernel - but time still 258 milliseconds.
Question 1: why writing to shared memory takes so long in my case ?
Question 2: is there any tool, which show in details kinda "execution plan" (timings for memory access, conflicts, etc) ?
Thanks for your suggestions.
Update:
playing with kernel - I set sdata to some constant for each thread:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = 111;
}
and timings are back to 48 millisec.
So, changing
sdata[tid] = mySum;
to
sdata[tid] = 111;
made this.
Is this compiler optimization (may be it just removed this line?) or by some reason copying from local memory (register?) to shared takes long?

Both of your kernels do not do anything, because they do not write out results to memory that would still be accessible after the kernel finishes.
In the first case, the compiler is clever enough to notice this and optimize away the whole calculation.
In the second case where shared memory is involved, the compiler does not notice this as the flow of information through shared memory would be harder to track. It thus leaves the calculation in.
Pass in a pointer to global memory (as you already do) and write out results via this pointer.

Shared memory is not the right thing for this. What you need are warp atomic operations, to sum up within a warp, then communicate the intermediary results between warps. There's example code demonstrating this shipping with CUDA.
Summing up elements is one of those tasks where massive parallization won't help much and the GPU in fact can be outperformed by a CPU.

Related

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:
When there is no bank conflict (offset=1)
When there is a bank conflict (offset=32, all threads are accessing bank 0)
Here is a sample of the code (only the kernel):
__global__ void kernel(int offset) {
__shared__ uint32_t shared_memory[MEMORY_SIZE];
// init shared memory
if (threadIdx.x == 0) {
for (int i = 0; i < MEMORY_SIZE; i++)
shared_memory[i] = i;
}
__syncthreads();
uint32_t index = threadIdx.x * offset;
// 2048 / 32 = 64
for (int i = 0; i < 64; i++)
{
shared_memory[index] += index * 10;
index += 32;
index %= MEMORY_SIZE;
__syncthreads();
}
}
I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?
You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

CUDA dot product gives wrong results

I wrote a dot product code with CUDA to compute the dot product of two double vectors. The kernel was invoked by N threads(N<1024) 1 block. But it can't give correct results. I can't figure it out.
__global__ void dotthread(double* a, double *b,double *sum, int N)
{
int tid = threadIdx.x;
sum[0] = sum[0]+a[tid] * b[tid]; //every thread write to the sum[0]
__syncthreads();
}
Let's look at two of your three lines of code:
sum[0] = sum[0]+a[tid] * b[tid]; //every thread write to the sum[0]
__syncthreads();
The first line contains a memory race. Every thread in the block will simultaneously attempt to write to sum[0]. There is nothing in the cuda execution model which can stop this from happening. There is no automatic serialization or memory protections which can stop this behaviour.
The second line is an instruction barrier. This means that each warp of threads will be blocked until every warp of threads has reached the barrier. It has no effect whatsoever on prior instructions, it has no effect whatsoever on memory consistency or on the behaviour of any memory transactions which your code issues.
The code you have written is irreversibly broken. The canonical way to perform this sort of operation would be via a parallel reduction. There are a number of different ways this can be done. It is probably the most described and documented parallel algorithm for GPUs. It you have installed the CUDA toolkit, you already have both a complete working example and a comprehensive paper describing the algorithm as it would be implemented using shared memory. I suggest you study it.
You can see an (almost) working implementation of a dot product using shared memory here which I recommend you study as well. You can also find optimal implementations of the parallel block reduction in libraries such as cub
I wrote two versions of dot product procedures. The first one uses the atomiAdd function, the second one allocates one shared variable for each block.
The computational time is 3.33 ms and 0.19 ms respectively compared to 0.17 ms and 411.43 ms of the reduction dot product and a one thread dot product.
GPU Device 0: "GeForce GTX 1070 Ti" with compute capability 6.1
2000000flow array allocated 2147483647
naive elapsedTimeIn Ms 411.437042 Ms
sum is 6.2831853071e+06
thread atomic add elapsedTimeIn Ms 3.3371520042 Ms
sum is 6.2831853071e+06
cache reduction elapsedTimeIn Ms 0.1764480025 Ms
sum is 6.2831853072e+06
cache atomicadd elapsedTimeIn Ms 0.1914239973 Ms
sum is 6.2831853072e+06
__global__ void dotatomicAdd(double* a, double *b,double *sum, int N)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (tid < N){
double t=a[tid] * b[tid];
atomicAdd(sum,t);
tid+=blockDim.x * gridDim.x;
}
}
__global__ void dotcache(double* a, double *b,double *c, int N)
{
__shared__ double cache;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
double temp = 0.0;
cache=0.0;
__syncthreads();
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
atomicAdd(&cache,temp);
__syncthreads();
if (cacheIndex == 0) c[blockIdx.x] = cache;
}

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

Improving CUDA SDK matrixMul by the prefetching intrinsic

Here is the part of CUDA SDK (2.3) matrixMultiply kernel:
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int XI=wA * ty + tx;
int XII=wB * ty + tx;
////////////////////
// PREFETCH BLOCK //
////////////////////
AS(ty, tx) = A[a + XI];
BS(ty, tx) = B[b + XII];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
__syncthreads();
}
This version of matrix multiply brings a tile into shared memory and performs the calculation at the shared memory bandwidth. I want to improve the performance by prefetching the data of next iteration into L1 cache. I use the prefetch intrinsic as suggested here and inserted the following commands into the PREFETCH BLOCK above:
long long int k,kk;
k=((long long int)A+aStep); if(k<=aEnd) prefetch_l1(k+XI);
kk=((long long int)B+bStep); if(kk<=aEnd) prefetch_l1(kk+XII);
After test, two versions (with or without prefetching) perform very similar (average of 3 runs):
without prefetching: 6434.866211 (ms)
with prefetching: 6480.041016 (ms)
Question:
I expect to see some speedup out of the prefetching but I'm confused with the results. Any body has any justification why these two implementations perform very close? Maybe I am performing a wrong prefetching.
Thank you in advance.
Further informations:
GPU: Tesla C2050
CUDA version: 4.0
inline __device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}
Prefetching (on any architecture) is only of benefit if:
you have memory bandwidth to spare and
you can initiate the prefetch at the correct time, i.e. sufficiently far ahead of time before the data is actually needed and at a time when spare memory bandwidth is available.
If you can't meet these criteria then prefetching is not going to help, and may even do more harm than good.

CUDA: Shared memory allocation with overlapping borders

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?
The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.
I.e the total memory size allocated to shared memory on each block is
(blocksize+patternlength)*sizeof(char)
I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.
UPDATE: I suspect some people have misunderstood my question (or I mis-explained it).
Say I have a dataset QWERTYUIOP, and i want to search for a 3 character match, and i dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx
This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.
In this case, what I want is for each block to have in shared memory:
QWERTY TYUIOP OPxxxx
ie each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.
Hope that explains things better.
Since #jmilloy is being persistent... :P
//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (T[startIndex+i] != P[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char shaP[];
for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
}
__syncthreads();
//At this point shaP is populated with the pattern
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
// only continue if an earlier instance hasn't already been found
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.
I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:
__global__ void kernel()
{
extern __shared__ int buffer[];
....
}
kernel<<< gridsize, blocksize, buffersize >>>();
but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.
No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.
As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).
Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.
I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!
edit 2
after answer was marked as correct
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char* shared;
char* shaP = &shared[0];
char* shaT = &shared[lenP];
//copy pattern into shaP in parallel
if(threadIdx.x < lenP)
shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);
//determine texT start and length for this block
blockStartIndex = blockIdx.x * gridDim.x/lenT;
lenS = gridDim.x/lenT + lenP - 1;
//copy text into shaT in parallel
shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
if(threadIdx.x < lenP)
shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)
__syncthreads();
//We have one pattern in shaP for each thread in the block
//We have the necessary portion of the text (with overlaps) in shaT
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}
key notes:
we only need one copy of the pattern in shared memory per block - they can all use it
shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
we can copy into shared memory in parallel (no need for for loops if you have enough threads)
Overlapping shared memory is not good, the thread will have to synchronize each time they want to access the same address in shared memory (although in architecture >= 2.0 this has been mitigated).
The simplest idea that comes into my mind is to duplicate the portion of the text that you want to be overlapped.
Instead of reading from the global memory in exact chuncks:
AAAA BBBB CCCC DDDD EEEE
Read with overlapping:
AAAA BBBB CCCC CCCC DDDD EEEEE