How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)? - cuda

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?

You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.

Related

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

CUDA: Why is striding faster than contiguous memory access? [duplicate]

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

Which atomics targetting pattern is better for a GPU warp: All-the-same or all-distinct?

Suppose we have:
A single warp (of 32 threads)
Each thread t which has 32 int values val t,0...val t,31,
Each value val t,i needs to be added, atomically, to a variable desti, which resides in (Option 1) global device memory (Option 2) shared block memory.
Which access pattern would be faster to perform these additions:
All threads atomic-add val t,1 to dest 1.
All threads atomic-add val t,2 to dest 2.
etc.
Each thread, with index t, writes val t,t to dest t
Each thread, with index t, writes val t, (t+1) mod 32 to dest (t+1) mod 32
etc.
In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.
Thoughts:
It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.
Notes:
This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.
This is how I understand your problem:
You have a 32x32 matrix of int:
Val0'0, Val1'0,....Val31'0
Val1'0, Val1'1,....Val31'1
.
.
Val31'0, Val31'1,...,Val31'31
And you need to sum each row:
Val0'0 + Val1'0 + ... Val31'0 = dest0
Val0'1 + Val1'1 + ... Val31'1 = dest1 etc.
The problem is that your rows values are distributed between different threads.
For a single warp, the cleanest way to approach this would be for each thread to share it's values using shared memory (in a 32x32 shared memory array). After thread sync, thread i sums the i'th row, and writes the results to dest(i) which can reside in global or shared memory (depends on you application).
This way the computation work (31x31) additions are divided evenly between the threads in the warp AND you don't need atomic ops (performance killers) at all.
From my experience atomic operation usually can and should be avoided by a different distribution of the work between the threads.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

In CUDA, what is memory coalescing, and how is it achieved?

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)