Bank conflicts in 2.x devices - cuda

What is a bank conflict in devices with 2.x devices? As I understand the CUDA C programming guide, in 2.x devices, if two threads access the same 32 bit word in the same shared memory bank, it does not cause a bank conflict. Instead, the word is broadcasted. When the two threads write the same 32 bit word in the same shared memory bank, then only one thread succeeds.
Since on-chip memory is 64 KB (48 KB for shared memory and 16 KB for L1, or vice versa), and it is organized in 32 banks, I am assuming that each bank consists of 2 KB. So I think that bank conflicts will arise if two threads access two different 32 bit words in the same shared memory bank. Is this correct?

Your description is correct. There are many access patterns that can generate bank conflicts, but here's a simple and common example: strided access.
__shared__ int smem[512];
int tid = threadIdx.x;
x = smem[tid * 2]; // 2-way bank conflicts
y = smem[tid * 4]; // 4-way bank conflicts
z = smem[tid * 8]; // 8-way bank conflicts
// etc.
Bank ID = index % 32, so if you look at the pattern of addresses in the x, y, and z accesses, you can see that in each warp of 32 threads, for x, 2 threads will access each bank, for y, 4 threads will access each bank, and for z, 8 threads will access each bank.

Related

CUDA: Why is striding faster than contiguous memory access? [duplicate]

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

Does Shader Unit calculates exponent

http://us.hardware.info/reviews/5419/nvidia-geforce-gtx-titan-z-sli-review-incl-tones-tizair-system
says that "GTX Titan-Z" has 5760 Shader units. Also here is written that "GTX Titan-Z" has 2x GK110 GPU.
CUDA exp() expf() and __expf() mentiones that it is possible to calculate exponent in cuda.
Let's say I have array of 500 000 000 ( five hundred millions ) of doubles. I want to calculate exponents of each of value in array. Who knows what to expect: 5760 shader units will be able to calculate exp, or this task can be done only with two GK110 GPU? Difference in perfomance is drastical, so I need to be sure, that if I rewrite my app with CUDA, then it will not work slower.
In other words, can I make 5760 threads to calculate 500 000 000 exponents?
GTX Titan Z is a dual GPU device. Each of the two GK110 GPUs on the card is attached via a 384-bit memory interface to its own 6 GB of high-speed memory. The theoretical bandwidth of each memory is 336 GB/sec. The particular GK110 variant used in the GTX Titan Z is comprised of fifteen clusters of execution units called SMX. Each SMX in turn is comprised of 192 single-precision floating-point units, 64 double-precision floating point units, and various other units.
Each double-precision unit in GK110 can execute one FMA (fused multiply-add), or one FMUL, or one FADD per clock cycle. At a base clock of 705 MHz, the maximum total number of DP operations that can be executed by each of the GK110 GPUs on Titan Z per second is therefore 705e6 * 15 * 64 = 676.8e9. Assuming all operations are FMAs, that equates to 1.3536 double-precision TFLOPS. Since the card uses two GPUs, the total DP performance of a GTX Titan Z is thus 2.7072 TFLOPS.
Like CPUs, GPUs provide general-purpose computation via various integer and floating-point units. GPUs also provide special function units (called MUFU = multifunction unit on GK110) that can compute rough single-precision approximations to some frequently used functions such as reciprocal, reciprocal square root, sine, cosine, exponential base 2, and logarithm based 2. As far as exponentiation is concerned, the standard single-precision math function exp2f() is the only function that maps more or less directly to a MUFU instruction (MUFU.EX2). Depending on compilation mode, there is a thin wrapper around this hardware instruction since the hardware does not support denormal operands in the special function units.
All other exponentiaton in CUDA is performed via software subroutines. The standard single-precision function expf() is a fairly heavy-weight wrapper around the hardware's exp2 capability. The double-precision exp() function is a pure software routine based on minimax polynomial approximation. The complete source code for it is visible in the CUDA header file math_functions_dbl_ptx3.h (in CUDA 6.5, DP exp() code starts at line 1706 in that file). As you can see, the computation involves primarily double-precision floating-point operations, as well as integer and some single-precision floating-point operations. You can also look at the machine code by disassembling a binary executable that calls exp() with cuobjdump --dump-sass.
In terms of performance, in CUDA 6.5 the double precision exp() function has a throughput on the order of 25e9 function calls per second on a Tesla K20 (1.170 DP TFLOPS). Since each call to DP exp() consumes an 8-byte source operand and produces an 8-byte result, this equates to roughly 400 GB/sec of memory bandwidth. Since each GK110 on a Titan Z provides about 15% more performance than the GK110 on a Tesla K20, the throughput and bandwidth requirements increase accordingly. Since the required bandwidth exceeds the theoretical memory bandwidth of the GPU, code that simply applies DP exp() to an array will be completely bound by memory bandwidth.
The number of functional units in the GPU and the number of threads executing has no relationship with the number of array elements that can be processed, but can have an impact on the performance of such processing. The mapping of array elements to threads can be freely chosen by the programmer. The number of array elements that can be processed in one go is a function of the size of the GPU's memory. Note that not all of the raw memory on the device is available for user code as the CUDA software stack needs some memory for its own use, typically around 100 MB or so. An exemplary mapping for applying DP exp() to an array is shown in this code snippet:
__global__ void exp_kernel (const double * __restrict__ src,
double * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = exp (src[i]);
}
}
#define ARRAY_LENGTH (500000000)
#define THREADS_PER_BLOCK (256)
int main (void) {
// ...
int len = ARRAY_LENGTH;
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > 65520) threadBlocks = 65520;
dim3 dimGrid(threadBlocks);
double *d_a = 0, *d_b = 0;
cudaMalloc((void**)&d_a, sizeof(d_a[0]), len);
cudaMalloc((void**)&d_b, sizeof(d_b[0]), len);
// ...
exp_kernel<<<dimGrid,dimBlock>>>(d_a, d_b, len);
// ...
}

CUDA shared memory broadcast and __syncthreads behavior

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.
Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

CUDA best memory access layouts: global memory coalescence and shared memory bank conflicts

I have some doubts regarding the most convenient global and shared memory access layouts in CUDA.
GLOBAL MEMORY
1) How the following memory addresses (0,0), (0,1), (1,0) and (1,1) are arranged in CPU memory and GPU memory? In other words, what is the order of in which they are stored ?
2) Which is the row index and which the column index in (m, n) ?
3) Is global memory coalescence achieved by accessing elements in column major order or row major order ?
SHARED MEMORY
1) How do bank conflicts arise or not arise? Please let me know using examples/cases.
2) What is the command to configure shared memory and L1 out of total 64K and where to locate that command?
Much part of your question has been already answered in the comments above. I just want to provide some rules that can be useful to you and in general to next users concerning coalesced memory accesses, some examples on shared memory bank conflicts and some rules on avoiding shared memory bank conflicts.
COALESCED MEMORY ACCESSES
1D array - 1D thread grid
gmem[blockDim.x * blockIdx.x + threadIdx.x]
2D array - 2D thread grid
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
gmem[y][x] or gmem[y * elementPitch + x]
SHARED MEMORY BANK CONFLICTS
To achieve high bandwidth, shared memory is divided into independent banks. In this way, shared memory can serve simultaneous accesses by the threads. Each Streaming Multiprocessor (SM) has shared memory organized in 32 memory banks. Each bank has a bandwidth of 32 bits per two clock cycles and hosts words of four bytes (32 bits): successive 32-bit word addresses are assigned to successive banks.
A bank conflict occurs when two different threads access different words in the same bank. Bank conflicts adversely impact performance since they enforce the hardware to serialize the access to shared memory. Note that there is no conflict if different threads access any bytes within the same word. Note also that there is no bank conflicts between
threads belonging to different warps.
Fast accesses
If all threads of a warp access different banks, there is no bank conflict;
If all threads of a warp access an identical address for a fetch operation, there is no
bank conflict (broadcast).
Slow accesses
32 threads access 32 different words in the same bank, so that all the accesses are serialized;
Generally speaking, the cost of accessing shared memory is proportional to the maximum number of simultaneous accesses to a single bank.
Example 1
smem[4]: accesses bank #4 (physically, the fifth one – first row)
smem[31]: accesses bank #31 (physically, the last one – first row)
smem[50]: accesses bank #18 (physically, the 19th one – second row)
smem[128]: accesses bank #0 (physically, the first one – fifth row)
smem[178]: accesses bank #18 (physically, the 19th one – sixth row)
If the third thread in a warp accesses myShMem[50] and the eight thread in the warp access myShMem[178], then you have a two-way bank conflict and the two transactions get serialized.
Example 2
Consider the following type of accesses
__shared__ float smem[256];
smem[b + s * threadIdx.x]
To have a bank conflict between two threads t1 and t2 of the same warp, the following conditions must hold
b + s * t2 = b + s * t1 + 32 * k, with k positive integer
0 <= t2 - t1 < 32
The above mean
32 * k = s * (t2 - t1)
0 <= t2 - t1 < 32
These two conditions do not hold true, namely no bank conflict, if s is odd.
Example 3
From Example 2, the following access
smem[b + threadIdx.x]
leads to no conflicts if smem is of a 32-bits data type. But also
extern __shared__ char smem[];
foo = smem[baseIndex + threadIdx.x];
and
extern __shared__ short smem[];
foo = smem[baseIndex + threadIdx.x];
lead to no bank conflicts, since one byte/thread is accessed and so different bytes of the same word are accessed.

What is a bank conflict? (Doing Cuda/OpenCL programming)

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science.
For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks (32banks for fermi), 16 or 32 banks for AMD gpus (57xx or higher: 32, everything below: 16)), which are interleaved with a granuity of 32bit (so byte 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:
Bank | 1 | 2 | 3 |...
Address | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 |...
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |...
...
So if each thread in a halfwarp accesses successive 32bit values there are no bank conflicts. An exception from this rule (every thread must access its own bank) are broadcasts:
If all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the halfwarp accessing the same address, iirc fermi and AMD gpus can do this for any number of threads accessing the same value).
The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.
In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-
Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.
int arr[512][512];
for ( j = 0; j < 512; j++ ) // outer loop
for ( i = 0; i < 512; i++ ) // inner loop
arr[i][j] = 2 * arr[i][j]; // column major processing
Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.
(CUDA Bank Conflict)
I hope this will help..
this is very good explaination ...
http://www.youtube.com/watch?v=CZgM3DEBplE
http://en.wikipedia.org/wiki/Memory_bank
and
http://mprc.pku.cn/mentors/training/ISCAreading/1989/p380-weiss/p380-weiss.pdf
from this page, you can find the detail about memory bank.
but it is a little different from what is said by #Grizzly.
in this page, the bank is like this
bank 1 2 3
address|0, 3, 6...| |1, 4, 7...| | 2, 5,8...|
hope this would help