CUDA shared memory bank conflict unexpected timing - cuda

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:
When there is no bank conflict (offset=1)
When there is a bank conflict (offset=32, all threads are accessing bank 0)
Here is a sample of the code (only the kernel):
__global__ void kernel(int offset) {
__shared__ uint32_t shared_memory[MEMORY_SIZE];
// init shared memory
if (threadIdx.x == 0) {
for (int i = 0; i < MEMORY_SIZE; i++)
shared_memory[i] = i;
}
__syncthreads();
uint32_t index = threadIdx.x * offset;
// 2048 / 32 = 64
for (int i = 0; i < 64; i++)
{
shared_memory[index] += index * 10;
index += 32;
index %= MEMORY_SIZE;
__syncthreads();
}
}
I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?

You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

Related

Free memory occupied by cudaMemGetInfo

I have the following simple code to find available GPUs
int * getFreeGpuList(int *numFree) {
int * gpuList;
int nDevices;
int i, j = 0, count = 0;
cudaGetDeviceCount(&nDevices);
gpuList = (int *) malloc(nDevices * sizeof(int));
for (i = 0; i < nDevices; ++i) {
cudaSetDevice(i);
size_t freeMem;
size_t totalMem;
cudaMemGetInfo(&freeMem, &totalMem);
if (freeMem > .9 * totalMem) {
gpuList[j] = i;
count++;
j++;
}
}
*numFree = count;
return gpuList;
}
The problem is that cudaMemGetInfo occupies some memory (~150MB in my case) in each GPU. This code is a part of a bigger program that runs for a long time, and I often run several processes at the same time, so in the end the memory occupied by this function is significant. Could you please let me know how I can free the GPU memory occupied by cudaMemGetInfo? Thanks!
Based on some insight from talonmies above that cudaSetDevice creates a context and occupies some memory in the device, I found out that cudaDeviceReset can "explicitly destroys and cleans up all resources associated with the current device in the current process" without affecting other processes on the same device.
Update Nov 26: If one wants to query GPU info, it's better using the NVML library. In my experience, it is much faster and does not take up memory for simple memory and name queryings.

CUDA: writing to shared memory increses kernel time execution a lot

I am trying to reduce 65536 elements array (calculate sum of elements in it) with help of CUDA. Kernel looks like following (please, ignore *dev_distanceFloats and index arguments for now)
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
}
ant it launched as one block with 256 threads:
kernel_calcSum <<<1, 256 >>>(dev_spFloats1, dev_distanceFloats, index);
So far, so good, each of 256 threads takes 256 elements from global memory and calculates it's sum in local variable mySum. Kernel execution time is about 45 milliseconds.
Next step is to introduce shared memory among those 256 threads in block (to calculate sum of mySum), so kernel becomes as following:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = mySum;
}
I just added writing to shared memory, but execution time increases from 45 milliseconds to 258 milliseconds (I am cheking this with help of NVidia Visual Profiler 5.0.0).
I realize that there are 8 bank conflicts for each thread when writing to sdata variable (I am on GTX670 which have capability 3.0 with 32 banks). As an experiment - I tried to reduce of threads to 32 when launching kernel - but time still 258 milliseconds.
Question 1: why writing to shared memory takes so long in my case ?
Question 2: is there any tool, which show in details kinda "execution plan" (timings for memory access, conflicts, etc) ?
Thanks for your suggestions.
Update:
playing with kernel - I set sdata to some constant for each thread:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = 111;
}
and timings are back to 48 millisec.
So, changing
sdata[tid] = mySum;
to
sdata[tid] = 111;
made this.
Is this compiler optimization (may be it just removed this line?) or by some reason copying from local memory (register?) to shared takes long?
Both of your kernels do not do anything, because they do not write out results to memory that would still be accessible after the kernel finishes.
In the first case, the compiler is clever enough to notice this and optimize away the whole calculation.
In the second case where shared memory is involved, the compiler does not notice this as the flow of information through shared memory would be harder to track. It thus leaves the calculation in.
Pass in a pointer to global memory (as you already do) and write out results via this pointer.
Shared memory is not the right thing for this. What you need are warp atomic operations, to sum up within a warp, then communicate the intermediary results between warps. There's example code demonstrating this shipping with CUDA.
Summing up elements is one of those tasks where massive parallization won't help much and the GPU in fact can be outperformed by a CPU.

Improving CUDA SDK matrixMul by the prefetching intrinsic

Here is the part of CUDA SDK (2.3) matrixMultiply kernel:
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int XI=wA * ty + tx;
int XII=wB * ty + tx;
////////////////////
// PREFETCH BLOCK //
////////////////////
AS(ty, tx) = A[a + XI];
BS(ty, tx) = B[b + XII];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
__syncthreads();
}
This version of matrix multiply brings a tile into shared memory and performs the calculation at the shared memory bandwidth. I want to improve the performance by prefetching the data of next iteration into L1 cache. I use the prefetch intrinsic as suggested here and inserted the following commands into the PREFETCH BLOCK above:
long long int k,kk;
k=((long long int)A+aStep); if(k<=aEnd) prefetch_l1(k+XI);
kk=((long long int)B+bStep); if(kk<=aEnd) prefetch_l1(kk+XII);
After test, two versions (with or without prefetching) perform very similar (average of 3 runs):
without prefetching: 6434.866211 (ms)
with prefetching: 6480.041016 (ms)
Question:
I expect to see some speedup out of the prefetching but I'm confused with the results. Any body has any justification why these two implementations perform very close? Maybe I am performing a wrong prefetching.
Thank you in advance.
Further informations:
GPU: Tesla C2050
CUDA version: 4.0
inline __device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}
Prefetching (on any architecture) is only of benefit if:
you have memory bandwidth to spare and
you can initiate the prefetch at the correct time, i.e. sufficiently far ahead of time before the data is actually needed and at a time when spare memory bandwidth is available.
If you can't meet these criteria then prefetching is not going to help, and may even do more harm than good.

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

CUDA: Shared memory allocation with overlapping borders

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?
The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.
I.e the total memory size allocated to shared memory on each block is
(blocksize+patternlength)*sizeof(char)
I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.
UPDATE: I suspect some people have misunderstood my question (or I mis-explained it).
Say I have a dataset QWERTYUIOP, and i want to search for a 3 character match, and i dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx
This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.
In this case, what I want is for each block to have in shared memory:
QWERTY TYUIOP OPxxxx
ie each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.
Hope that explains things better.
Since #jmilloy is being persistent... :P
//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (T[startIndex+i] != P[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char shaP[];
for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
}
__syncthreads();
//At this point shaP is populated with the pattern
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
// only continue if an earlier instance hasn't already been found
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.
I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:
__global__ void kernel()
{
extern __shared__ int buffer[];
....
}
kernel<<< gridsize, blocksize, buffersize >>>();
but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.
No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.
As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).
Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.
I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!
edit 2
after answer was marked as correct
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char* shared;
char* shaP = &shared[0];
char* shaT = &shared[lenP];
//copy pattern into shaP in parallel
if(threadIdx.x < lenP)
shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);
//determine texT start and length for this block
blockStartIndex = blockIdx.x * gridDim.x/lenT;
lenS = gridDim.x/lenT + lenP - 1;
//copy text into shaT in parallel
shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
if(threadIdx.x < lenP)
shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)
__syncthreads();
//We have one pattern in shaP for each thread in the block
//We have the necessary portion of the text (with overlaps) in shaT
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}
key notes:
we only need one copy of the pattern in shared memory per block - they can all use it
shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
we can copy into shared memory in parallel (no need for for loops if you have enough threads)
Overlapping shared memory is not good, the thread will have to synchronize each time they want to access the same address in shared memory (although in architecture >= 2.0 this has been mitigated).
The simplest idea that comes into my mind is to duplicate the portion of the text that you want to be overlapped.
Instead of reading from the global memory in exact chuncks:
AAAA BBBB CCCC DDDD EEEE
Read with overlapping:
AAAA BBBB CCCC CCCC DDDD EEEEE