What is a bank conflict? (Doing Cuda/OpenCL programming) - cuda

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science.

For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks (32banks for fermi), 16 or 32 banks for AMD gpus (57xx or higher: 32, everything below: 16)), which are interleaved with a granuity of 32bit (so byte 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:
Bank | 1 | 2 | 3 |...
Address | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 |...
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |...
...
So if each thread in a halfwarp accesses successive 32bit values there are no bank conflicts. An exception from this rule (every thread must access its own bank) are broadcasts:
If all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the halfwarp accessing the same address, iirc fermi and AMD gpus can do this for any number of threads accessing the same value).

The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.

In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-
Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.
int arr[512][512];
for ( j = 0; j < 512; j++ ) // outer loop
for ( i = 0; i < 512; i++ ) // inner loop
arr[i][j] = 2 * arr[i][j]; // column major processing
Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.

(CUDA Bank Conflict)
I hope this will help..
this is very good explaination ...
http://www.youtube.com/watch?v=CZgM3DEBplE

http://en.wikipedia.org/wiki/Memory_bank
and
http://mprc.pku.cn/mentors/training/ISCAreading/1989/p380-weiss/p380-weiss.pdf
from this page, you can find the detail about memory bank.
but it is a little different from what is said by #Grizzly.
in this page, the bank is like this
bank 1 2 3
address|0, 3, 6...| |1, 4, 7...| | 2, 5,8...|
hope this would help

Related

CUDA memory bank conflict

I would like to be sure that I understand correctly bank conflicts in shared memory.
I have 32 portions of data.
These portions consists of 128 integers.
|0, 1, 2, ..., 125, 126, 127| ... |3968, 3969, 3970, ..., 4093, 4094, 4095|
Each thread in a warp access only it's own portion.
Thread 0 access position 0(0) in portion 0
Thread 1 access position 0(128) in portion 1
Thread 31 access position 0(3968) in portion 31
Does it mean that I have here 32 conflicts?
If yes, then if I will stretch portions to 129 elements, then each thread will access unique bank. Am I right?
Yes, you will have 32-way bank conflicts. For the purposes of bank conflicts, it may help to visualize shared memory as a two-dimensional array, whose width is 32 elements (e.g. 32 int or float quantities, for example). Each column in this 2D array is a "bank".
Overlay your storage pattern on that. When you do so, you will see that your stated access pattern will result in all threads in the warp will be requesting items from column 0.
Yes, the usual "trick" here is to pad the storage by 1 element per "row" (in your case this could be one element per "portion"). That should eliminate bank conflicts for your stated access pattern.

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

How to get 100% GPU usage using CUDA

I wonder how I can generate high load in a GPU, step by step, though.
What I'm trying to do is a program which put the maximum load in a MP, then in other, until reach the total number of MP.
It would be similar to execute a "while true" in every single core of a CPU, but I'm not sure if the same paradigm would work on a GPU with CUDA.
Can you help me?
If you want to do a stress-test/power consumption test, you'll need to pick the workload. The highest power consumption with compute-only code you'll most likely get with some synthetic benchmark that feed the GPUs with the optimal mix and sequence of operations. Otherwise, BLAS level 3 is probably quite close to optimal.
Putting load only on a certain number of multi-processors will require that you tweak the workload to limit the block-level parallelism.
Briefly, this is what I'd do:
Pick a code that is well-optimized and known to utilize the GPU to a great extent (high IPC, high power consumption, etc.). Have a look around on the CUDA developer forums, you should be able to find hand-tuned BLAS code or something alike.
Change the code to force it to run on a given number of multi-processors. This will require that you tune the number of blocks and threads to produce exactly the right amount of load for the number of processors you want to utilize.
Profile: the profiler counters can show you the amount of instruction per multi-processor which gives you a check that you are indeed only running on the desired number of processors as well as other counters that can indicate how efficiently is the code running.
as well as
Measure. If you have a Tesla or Quadro you get power consumption out of the box. Otherwise, try the nvml fix. Without a power measurement it will be hard for you to know how far are you from the TDP and especially weather the GPU is throttling.
Some of my benchmarks carry out the same calculations via CUDA, OpenMP and programmed multithreading. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. A range of data sizes are also used. [Free] Benchmarks, source codes and results for Linux are via my page:
http://www.roylongbottom.org.uk/linux%20benchmarks.htm
I also provide Windows varieties. Following are some CUDA results, showing maximum speed of 412 GFLOPS using a GeForce GTX 650. On the quad core/8 thread Core i7, OpenMP produced up to 91 GFLOPS and multithreading up to 93 GFLOPS using SSE instructions and 178 GFLOPS with AVX 1. See also section on Burn-In and Reliability Apps, where the most demanding CUDA test is run for a period to show temperature gains, at the same time as CPU stress tests.
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes

CUDA shared memory, does gap-access pattern penalizes performance?

I am dealing with a CUDA shared memory access pattern which i am not sure if it is good or has some sort of performance penalty.
Suppose i have 512 integer numbers in shared memory
__shared__ int snums[516];
and half the threads, that is 256 threads.
The kernel works as follows;
(1) The block of 256 threads first applies a function f(x) to the even locations of snums[], then (2) it applies f(x) to the odd locations of snums[]. Function f(x) acts on the local neighborhood of the given number x, then changes x to a new value. There is a __syncthreads() in between (1) and (2).
Clearly, while i am doing (1), there are shared memory gaps of 32bits because of the odd numbers not being accessed. The same occurs in (2), there will be gaps on the even locations of snums[].
From what i read on CUDA documentation, memory bank conflicts should occur when threads access the same locations. But they do not talk about gaps.
Will there there be any problem with banks that could incur in a performance penalty?
I guess you meant:
__shared__ int snums[512];
Will there be any bank conflict and performance penalty?
Assuming at some point your code does something like:
int a = snums[2*threadIdx.x]; // this would access every even location
the above line of code would generate an access pattern with 2-way bank conflicts. 2-way bank conflicts means the above line of code takes approximately twice as long to execute as the optimal no-bank-conflict line of code (depicted below).
If we were to focus only on the above line of code, the obvious approach to eliminating the bank conflict would be to re-order the storage pattern in shared memory so that all of the data items previously stored at snums[0], snums[2], snums[4] ... are now stored at snums[0], snums[1], snums[2] ... thus effectively moving the "even" items to the beginning of the array and the "odd" items to the end of the array. That would allow an access like so:
int a = snums[threadIdx.x]; // no bank conflicts
However you have stated that a calculation neighborhood is important:
Function f(x) acts on the local neighborhood of the given number x,...
So this sort of reorganization might require some special indexing arithmetic.
On newer architectures, shared memory bank conflicts don't occur when threads access the same location but do occur if they access locations in the same bank (that are not the same location). The bank is simply the lowest order bits of the 32-bit index address:
snums[0] : bank 0
snums[1] : bank 1
snums[2] : bank 2
...
snums[32] : bank 0
snums[33] : bank 1
...
(the above assumes 32-bit bank mode)
This answer may also be of interest

Bank conflicts in 2.x devices

What is a bank conflict in devices with 2.x devices? As I understand the CUDA C programming guide, in 2.x devices, if two threads access the same 32 bit word in the same shared memory bank, it does not cause a bank conflict. Instead, the word is broadcasted. When the two threads write the same 32 bit word in the same shared memory bank, then only one thread succeeds.
Since on-chip memory is 64 KB (48 KB for shared memory and 16 KB for L1, or vice versa), and it is organized in 32 banks, I am assuming that each bank consists of 2 KB. So I think that bank conflicts will arise if two threads access two different 32 bit words in the same shared memory bank. Is this correct?
Your description is correct. There are many access patterns that can generate bank conflicts, but here's a simple and common example: strided access.
__shared__ int smem[512];
int tid = threadIdx.x;
x = smem[tid * 2]; // 2-way bank conflicts
y = smem[tid * 4]; // 4-way bank conflicts
z = smem[tid * 8]; // 8-way bank conflicts
// etc.
Bank ID = index % 32, so if you look at the pattern of addresses in the x, y, and z accesses, you can see that in each warp of 32 threads, for x, 2 threads will access each bank, for y, 4 threads will access each bank, and for z, 8 threads will access each bank.