CUDA shared memory, does gap-access pattern penalizes performance? - cuda

I am dealing with a CUDA shared memory access pattern which i am not sure if it is good or has some sort of performance penalty.
Suppose i have 512 integer numbers in shared memory
__shared__ int snums[516];
and half the threads, that is 256 threads.
The kernel works as follows;
(1) The block of 256 threads first applies a function f(x) to the even locations of snums[], then (2) it applies f(x) to the odd locations of snums[]. Function f(x) acts on the local neighborhood of the given number x, then changes x to a new value. There is a __syncthreads() in between (1) and (2).
Clearly, while i am doing (1), there are shared memory gaps of 32bits because of the odd numbers not being accessed. The same occurs in (2), there will be gaps on the even locations of snums[].
From what i read on CUDA documentation, memory bank conflicts should occur when threads access the same locations. But they do not talk about gaps.
Will there there be any problem with banks that could incur in a performance penalty?

I guess you meant:
__shared__ int snums[512];
Will there be any bank conflict and performance penalty?
Assuming at some point your code does something like:
int a = snums[2*threadIdx.x]; // this would access every even location
the above line of code would generate an access pattern with 2-way bank conflicts. 2-way bank conflicts means the above line of code takes approximately twice as long to execute as the optimal no-bank-conflict line of code (depicted below).
If we were to focus only on the above line of code, the obvious approach to eliminating the bank conflict would be to re-order the storage pattern in shared memory so that all of the data items previously stored at snums[0], snums[2], snums[4] ... are now stored at snums[0], snums[1], snums[2] ... thus effectively moving the "even" items to the beginning of the array and the "odd" items to the end of the array. That would allow an access like so:
int a = snums[threadIdx.x]; // no bank conflicts
However you have stated that a calculation neighborhood is important:
Function f(x) acts on the local neighborhood of the given number x,...
So this sort of reorganization might require some special indexing arithmetic.
On newer architectures, shared memory bank conflicts don't occur when threads access the same location but do occur if they access locations in the same bank (that are not the same location). The bank is simply the lowest order bits of the 32-bit index address:
snums[0] : bank 0
snums[1] : bank 1
snums[2] : bank 2
...
snums[32] : bank 0
snums[33] : bank 1
...
(the above assumes 32-bit bank mode)
This answer may also be of interest

Related

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

Which atomics targetting pattern is better for a GPU warp: All-the-same or all-distinct?

Suppose we have:
A single warp (of 32 threads)
Each thread t which has 32 int values val t,0...val t,31,
Each value val t,i needs to be added, atomically, to a variable desti, which resides in (Option 1) global device memory (Option 2) shared block memory.
Which access pattern would be faster to perform these additions:
All threads atomic-add val t,1 to dest 1.
All threads atomic-add val t,2 to dest 2.
etc.
Each thread, with index t, writes val t,t to dest t
Each thread, with index t, writes val t, (t+1) mod 32 to dest (t+1) mod 32
etc.
In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.
Thoughts:
It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.
Notes:
This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.
This is how I understand your problem:
You have a 32x32 matrix of int:
Val0'0, Val1'0,....Val31'0
Val1'0, Val1'1,....Val31'1
.
.
Val31'0, Val31'1,...,Val31'31
And you need to sum each row:
Val0'0 + Val1'0 + ... Val31'0 = dest0
Val0'1 + Val1'1 + ... Val31'1 = dest1 etc.
The problem is that your rows values are distributed between different threads.
For a single warp, the cleanest way to approach this would be for each thread to share it's values using shared memory (in a 32x32 shared memory array). After thread sync, thread i sums the i'th row, and writes the results to dest(i) which can reside in global or shared memory (depends on you application).
This way the computation work (31x31) additions are divided evenly between the threads in the warp AND you don't need atomic ops (performance killers) at all.
From my experience atomic operation usually can and should be avoided by a different distribution of the work between the threads.

3 questions about alignment

The discussion is restricted to compute capability 2.x
Question 1
The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?
Question 2
The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?
Question 3
struct
{
double x, y, z;
}Position
Suppose each thread is accessing the structure above:
int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];
To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?
Thanks for your time!
(1) They are placed contiguously in memory.
(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.
(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.

How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?
You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.