I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear to me how to avoid bank conflicts when doing so. Since my interactions are local, I was planning on sorting my particle data into local groups that I can send to each SM's shared memory (not yet worrying about particles that have a neighbor who is being worked on from another SM. In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
All of the information I see seems to only mention that memory be coalesced for the copy from global memory to shared memory, but don't I see anything about whether threads in a warp (or the whole SM) care about coalesence in shared memory.
In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
bank conflicts are only possible between threads in a single warp that are performing a shared memory access, and then only possible on a per-instruction (issued) basis. The instructions I am talking about here are SASS (GPU assembly code) instructions, but nevertheless should be directly identifiable from shared memory references in CUDA C++ source code.
There is no such idea as bank conflicts:
between threads in different warps
between shared memory accesses arising from different (issued) instructions
A given thread may access shared memory in any pattern, with no concern or possibility of shared memory back conflicts, due to its own activity. Bank conflicts only arise as a result of 2 or more threads in a single warp, as a result of a particular shared memory instruction or access, issued warp-wide.
Furthermore it is not sufficient that each thread reads/writes from/to a different address. For a given issued instruction (i.e. a given access) roughly speaking, each thread in the warp must read from a different bank, or else it must read from an address that is the same as another address in the warp (broadcast).
Let's assume that we are referring to 32-bit banks, and an arrangement of 32 banks.
Shared memory can readily be imagined as a 2D arrangement:
Addr Bank
v 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 96 97 98 ...
We see that addresses/index/offset/locations 0, 32, 64, 96 etc. are in the same bank. Addresses 1, 33, 65, 97, etc. are in the same bank, and so on, for each of the 32 banks. Banks are like columns of locations when the addresses of shared memory are visualized in this 2D arrangement
The requirement for non-bank-conflicted access for a given instruction (load or store) issued to a warp is:
no 2 threads in the warp may access locations in the same bank/column.
a special case exists if the locations in the same column are actually the same location. This invokes the broadcast rule and does not lead to bank conflicts.
And to repeat some statements above in a slightly different way:
If I have a loop in CUDA code, there is no possibility for bank conflicts to arise between separate iterations of that loop
If I have two separate lines of CUDA C++ code, there is no possibility for bank conflicts to arise between those two separate lines of CUDA C++ code.
Related
My single-page web application runs into a memory leak situation with Chromium on Linux. My app fetches html markup and embeds it into the main page via .innerHTML. The remote html snippets may reference jpgs or pngs. The memory leak does not occur in JS-heap. This can be seen in the table below.
Column A shows the size of the JS-heap as reported by performance.memory.usedJSHeapSize. (I am running Chromium with -enable-precise-memory-info parameter to get precise values.)
Column B shows the total amount of memory occupied by Chromium as shown by top (only start and end values sampled).
Column C shows the amount of "available memory" in Linux as reported by /proc/meminfo.
A B C
01 1337628 234.5 MB 522964 KB
02 1372198 499404 KB
03 1500304 499864 KB
04 1568540 485476 KB
05 1651320 478048 KB
06 1718846 489684 KB GC
07 1300169 450240 KB
08 1394914 475624 KB
09 1462320 472540 KB
10 1516964 471064 KB
11 1644589 459604 KB GC
12 1287521 441532 KB
13 1446901 449220 KB
14 1580417 436504 KB
15 1690518 457488 KB
16 1772467 444216 KB GC
17 1261924 418896 KB
18 1329657 439252 KB
19 1403951 436028 KB
20 1498403 434292 KB
21 1607942 429272 KB GC
22 1298138 403828 KB
23 1402844 412368 KB
24 1498350 412560 KB
25 1570854 409912 KB
26 1639122 419268 KB
27 1715667 399460 KB GC
28 1327934 379188 KB
29 1438188 417764 KB
30 1499364 401160 KB
31 1646557 406020 KB
32 1720947 402000 KB GC
33 1369626 283.3 MB 378324 KB
While the JS-heap only alters between 1.3 MB and 1.8 MB during my 33-step test, Chromium memory (as reported by top) grows by 48.8 MB (from 234.5 MB to 283.3 MB). And according to /proc/meminfo the "available memory" is even shrinking by 145 MB (from 522964 KB to 378324 KB in column C) at the same time. I assume Chromium is occupying some good amount of cache memory outside of the reported 283.3 MB. Note that I am invoking GC manually 6 times via developer tools.
Before running the test, I have stopped all unnecessary services and killed all unneeded processes. No other code is doing any essential work in parallel. There are no browser extensions and no other open tabs.
The memory leak appears to be in native memory and probably involves the images being shown. They are never being freed, it seems. This issue appears to be similar to 1 (bugs.webkit.org). I have applied all the usual suggestions as listed here 2. If I keep the app running, the amount of memory occupied by Chromium will grow unbound up until everything comes to a crawl or until the Linux OOM killer strikes. The one thing I can do (before it is too late) is to switch the browser to a different URL. This will releases all native memory and I can then return to my app and continue with a completely fresh memory situation. But this is not a real solution.
Q: Can the native memory caches be freed in a more programatic way?
Have tried some of the online references as wells as unix time form at etc. but none of these seem to work. See the examples below.
running Mysql 5.5.5 in ubuntu. innodb engine.
nothing is custom. This is using a built in datetime function.
Here are some examples with the 6 byte hex string and the decoded message below. We are looking for the decoding algorithm. i.e.how to turn the 6 byte hex string into the correct date/time. The algorithm must work correctly on the examples below. The right most byte seems to indicate difference in seconds correctly for small small differences in time between records. i.e. we show an example with 14 sec difference.
full records,nicely highlighted and formated word doc here.
https://www.dropbox.com/s/zsqy9o2rw1h0e09/mysql%20datetime%20examples%20.docx?dl=0
link to formatted word document with the examples.
contact frank%simrex.com re. reward.
replace % with #
hex strings and decoded date/time pairs are below.
pulled from healthy file running mysql
12 51 72 78 B9 46 ... 2014-10-22 16:53:18
12 51 72 78 B9 54 ... 2014-10-22 16:53:32
12 51 72 78 BA 13 ... 2014-10-22 16:55:23
12 51 72 78 CC 27 ... 2014-10-22 17:01:51
here you go.
select str_to_date(conv(replace('12 51 72 78 CC 27',' ', ''), 16, 10), '%Y%m%d%H%i%s')
I wonder how I can generate high load in a GPU, step by step, though.
What I'm trying to do is a program which put the maximum load in a MP, then in other, until reach the total number of MP.
It would be similar to execute a "while true" in every single core of a CPU, but I'm not sure if the same paradigm would work on a GPU with CUDA.
Can you help me?
If you want to do a stress-test/power consumption test, you'll need to pick the workload. The highest power consumption with compute-only code you'll most likely get with some synthetic benchmark that feed the GPUs with the optimal mix and sequence of operations. Otherwise, BLAS level 3 is probably quite close to optimal.
Putting load only on a certain number of multi-processors will require that you tweak the workload to limit the block-level parallelism.
Briefly, this is what I'd do:
Pick a code that is well-optimized and known to utilize the GPU to a great extent (high IPC, high power consumption, etc.). Have a look around on the CUDA developer forums, you should be able to find hand-tuned BLAS code or something alike.
Change the code to force it to run on a given number of multi-processors. This will require that you tune the number of blocks and threads to produce exactly the right amount of load for the number of processors you want to utilize.
Profile: the profiler counters can show you the amount of instruction per multi-processor which gives you a check that you are indeed only running on the desired number of processors as well as other counters that can indicate how efficiently is the code running.
as well as
Measure. If you have a Tesla or Quadro you get power consumption out of the box. Otherwise, try the nvml fix. Without a power measurement it will be hard for you to know how far are you from the TDP and especially weather the GPU is throttling.
Some of my benchmarks carry out the same calculations via CUDA, OpenMP and programmed multithreading. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. A range of data sizes are also used. [Free] Benchmarks, source codes and results for Linux are via my page:
http://www.roylongbottom.org.uk/linux%20benchmarks.htm
I also provide Windows varieties. Following are some CUDA results, showing maximum speed of 412 GFLOPS using a GeForce GTX 650. On the quad core/8 thread Core i7, OpenMP produced up to 91 GFLOPS and multithreading up to 93 GFLOPS using SSE instructions and 178 GFLOPS with AVX 1. See also section on Burn-In and Reliability Apps, where the most demanding CUDA test is run for a period to show temperature gains, at the same time as CPU stress tests.
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes
I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science.
For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks (32banks for fermi), 16 or 32 banks for AMD gpus (57xx or higher: 32, everything below: 16)), which are interleaved with a granuity of 32bit (so byte 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:
Bank | 1 | 2 | 3 |...
Address | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 |...
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |...
...
So if each thread in a halfwarp accesses successive 32bit values there are no bank conflicts. An exception from this rule (every thread must access its own bank) are broadcasts:
If all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the halfwarp accessing the same address, iirc fermi and AMD gpus can do this for any number of threads accessing the same value).
The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.
In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-
Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.
int arr[512][512];
for ( j = 0; j < 512; j++ ) // outer loop
for ( i = 0; i < 512; i++ ) // inner loop
arr[i][j] = 2 * arr[i][j]; // column major processing
Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.
(CUDA Bank Conflict)
I hope this will help..
this is very good explaination ...
http://www.youtube.com/watch?v=CZgM3DEBplE
http://en.wikipedia.org/wiki/Memory_bank
and
http://mprc.pku.cn/mentors/training/ISCAreading/1989/p380-weiss/p380-weiss.pdf
from this page, you can find the detail about memory bank.
but it is a little different from what is said by #Grizzly.
in this page, the bank is like this
bank 1 2 3
address|0, 3, 6...| |1, 4, 7...| | 2, 5,8...|
hope this would help
I have a cuda loop where a variable cumul store an accumulation in double :
double cumulative_value = (double)0;
loop(...)
{
// ...
double valueY = computeValueY();
// ...
cumulative_value += valueY
}
This code is compiled on different SDK and run on two computers :
M1 : TeslaM2075 CUDA 5.0
M2 : TeslaM2075 CUDA 7.5
At step 10, results are differents. Values for this addition (double precision representation in hexadecimal) are:
0x 41 0d d3 17 34 79 27 4d => cumulative_value
+ 0x 40 b6 60 1d 78 6f 09 b0 => valueY
-------------------------------------------------------
=
0x 41 0e 86 18 20 3c 9f 9b (for M1)
0x 41 0e 86 18 20 3c 9f 9a (for M2)
Rounding mode is not specified as I can see in the ptx cuda file ( == add.f64) but M1 seems to use round to plus Infinity and M1 an other mode.
If I force M2 with one of the 4 rounding modes (__dadd_XX()) for this instruction, cumulative_value is always different than M1 even before step 10.
But if I force M1 and M2 with the same rounding mode, results are the same but not equals to M1 before modification.
My aim is to get M1 (cuda 5.0) results on M2 machine (cuda 7.5) but I don't understand the default rounding mode behavior at runtime. I am wondering if the rouding mode is dynamic at runtime if not specified. Do you have you an idea ?
After another ptx analysis and in my case, valueY is computed from a FMA instruction on cuda 5.0 while cuda 7.5 compiler uses MUL and ADD instructions. Cuda documentation explains there is only one rounding step using single FMA instruction while there are two rounding steps using MUL and ADD. Thank you very much for helping me :)