I wonder how I can generate high load in a GPU, step by step, though.
What I'm trying to do is a program which put the maximum load in a MP, then in other, until reach the total number of MP.
It would be similar to execute a "while true" in every single core of a CPU, but I'm not sure if the same paradigm would work on a GPU with CUDA.
Can you help me?
If you want to do a stress-test/power consumption test, you'll need to pick the workload. The highest power consumption with compute-only code you'll most likely get with some synthetic benchmark that feed the GPUs with the optimal mix and sequence of operations. Otherwise, BLAS level 3 is probably quite close to optimal.
Putting load only on a certain number of multi-processors will require that you tweak the workload to limit the block-level parallelism.
Briefly, this is what I'd do:
Pick a code that is well-optimized and known to utilize the GPU to a great extent (high IPC, high power consumption, etc.). Have a look around on the CUDA developer forums, you should be able to find hand-tuned BLAS code or something alike.
Change the code to force it to run on a given number of multi-processors. This will require that you tune the number of blocks and threads to produce exactly the right amount of load for the number of processors you want to utilize.
Profile: the profiler counters can show you the amount of instruction per multi-processor which gives you a check that you are indeed only running on the desired number of processors as well as other counters that can indicate how efficiently is the code running.
as well as
Measure. If you have a Tesla or Quadro you get power consumption out of the box. Otherwise, try the nvml fix. Without a power measurement it will be hard for you to know how far are you from the TDP and especially weather the GPU is throttling.
Some of my benchmarks carry out the same calculations via CUDA, OpenMP and programmed multithreading. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. A range of data sizes are also used. [Free] Benchmarks, source codes and results for Linux are via my page:
http://www.roylongbottom.org.uk/linux%20benchmarks.htm
I also provide Windows varieties. Following are some CUDA results, showing maximum speed of 412 GFLOPS using a GeForce GTX 650. On the quad core/8 thread Core i7, OpenMP produced up to 91 GFLOPS and multithreading up to 93 GFLOPS using SSE instructions and 178 GFLOPS with AVX 1. See also section on Burn-In and Reliability Apps, where the most demanding CUDA test is run for a period to show temperature gains, at the same time as CPU stress tests.
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes
Related
In GCP Cloud SQL for mySQL,
Say for Network throughput (MB/s) it specified
250 of 2000
Is 250 the average value and 2000 the maximum value attainable?
If you click on the question mark (to the left of your red square] it will lead you to this doc. You will see that limiting factor is 16 Gbps.
Now using converter, 16 Gbps is 2000MB/s.
If you change your machine type to high mem, 8 vCPU or 16, you will see the cap at 2000. So i suspect 250MB/s is the allocated value for the machine type you chose, the 1cpu.
I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear to me how to avoid bank conflicts when doing so. Since my interactions are local, I was planning on sorting my particle data into local groups that I can send to each SM's shared memory (not yet worrying about particles that have a neighbor who is being worked on from another SM. In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
All of the information I see seems to only mention that memory be coalesced for the copy from global memory to shared memory, but don't I see anything about whether threads in a warp (or the whole SM) care about coalesence in shared memory.
In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
bank conflicts are only possible between threads in a single warp that are performing a shared memory access, and then only possible on a per-instruction (issued) basis. The instructions I am talking about here are SASS (GPU assembly code) instructions, but nevertheless should be directly identifiable from shared memory references in CUDA C++ source code.
There is no such idea as bank conflicts:
between threads in different warps
between shared memory accesses arising from different (issued) instructions
A given thread may access shared memory in any pattern, with no concern or possibility of shared memory back conflicts, due to its own activity. Bank conflicts only arise as a result of 2 or more threads in a single warp, as a result of a particular shared memory instruction or access, issued warp-wide.
Furthermore it is not sufficient that each thread reads/writes from/to a different address. For a given issued instruction (i.e. a given access) roughly speaking, each thread in the warp must read from a different bank, or else it must read from an address that is the same as another address in the warp (broadcast).
Let's assume that we are referring to 32-bit banks, and an arrangement of 32 banks.
Shared memory can readily be imagined as a 2D arrangement:
Addr Bank
v 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 96 97 98 ...
We see that addresses/index/offset/locations 0, 32, 64, 96 etc. are in the same bank. Addresses 1, 33, 65, 97, etc. are in the same bank, and so on, for each of the 32 banks. Banks are like columns of locations when the addresses of shared memory are visualized in this 2D arrangement
The requirement for non-bank-conflicted access for a given instruction (load or store) issued to a warp is:
no 2 threads in the warp may access locations in the same bank/column.
a special case exists if the locations in the same column are actually the same location. This invokes the broadcast rule and does not lead to bank conflicts.
And to repeat some statements above in a slightly different way:
If I have a loop in CUDA code, there is no possibility for bank conflicts to arise between separate iterations of that loop
If I have two separate lines of CUDA C++ code, there is no possibility for bank conflicts to arise between those two separate lines of CUDA C++ code.
Q1: why there is different information i got from Nvidia control panel->system information and information from device query example in cuda sdk.
system information:
cuda cores 384 cores
memory data rate 1800MHz
device query output:
cuda cores= 2 MP x 192 SP/MP = 576 cuda cores
memory clock rate 900MHz
Q2: how can i calculate the GFLOPs of my GPU using device query data?
the most common used formula i found was the one mentioned here which suggest using Number of mul-add units, number of mul units which i don't know?
Max GFLOPS (Cores x SIMDs x ([mul-add]x2+[mul]*1)*clock speed)
Q1: It tells you right there just above the line...
MapSMtoCores for SM 5.0 is unefined. Default to use 192 Cores/SM
Maxwell, the architecture behind the GeForce 840M, uses 128 "cores" per "SMM"
3 * 128 = 384
Q2: "Cores" * frequency * 2 (because each core can do a multiply+add)
I'm facing difficulty in understanding sm_cta counter in CUDA profiler. I'm launching 128 blocks and my launch bound configs are __launch_bounds(192,8) but profiler is showing 133 for particular run. I profiled app for several times but it is around 133 everytime. What this counter indicates? Using Tesla C2075, Linux 32bit.
The NVIDIA GPUs have Performance Monitor Units in multiple locations of the chip. On Fermi devices the sm_cta_launched signal is collected by the GPC monitor not the SM monitor. The Fermi GPC Performance Monitor is limited to observing 1 SM per GPC. The C2075 has 4 GPCs and 14 SMs. A C2075 could have a configuration of 2 GPCs with 4 SMs and 2 GPCs with 3 SMs. The CUDA profiler will collect the counter for 1 SM per GPC and multiply the result by the number of SMs in the GPC. The final value can be higher or lower than the expected value. For example:
GPC SMs Counter Value
0 4 8 32
1 4 8 32
2 3 11 33
3 3 12 36
---------------------------
133
In the document Compute Command Line Profiler this information is specified under the countermodeaggregate option.
countermodeaggregate
If this option is selected then aggregate counter values will be
output. For a SM counter the counter value is the sum of the counter
values from all SMs. For l1*, tex*, sm_cta_launched,
uncached_global_load_transaction and global_store_transaction counters
the counter value is collected for 1 SM from each GPC and it is
extrapolated for all SMs. This option is supported only for CUDA
devices with compute capability 2.0 or higher.
A more accurate value can be achieved by using warps_launched which is collected per SM using the formula:
thread_blocks_launched = warps_launched
/ ((threadblocksizeX * threadblocksizeY * threadblocksizeZ) + WARP_SIZE - 1)
/ WARP_SIZE
where WARP_SIZE is 32 on all current devices.
NOTE: This approach will not be correct for Dynamic Parallelism.
Some CUDA library functions are also implemented using kernels internally, so it is not surprising the total number of blocks executed is slightly higher than what you explicitly launched yourself.
I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science.
For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks (32banks for fermi), 16 or 32 banks for AMD gpus (57xx or higher: 32, everything below: 16)), which are interleaved with a granuity of 32bit (so byte 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:
Bank | 1 | 2 | 3 |...
Address | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 |...
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |...
...
So if each thread in a halfwarp accesses successive 32bit values there are no bank conflicts. An exception from this rule (every thread must access its own bank) are broadcasts:
If all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the halfwarp accessing the same address, iirc fermi and AMD gpus can do this for any number of threads accessing the same value).
The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.
In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-
Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.
int arr[512][512];
for ( j = 0; j < 512; j++ ) // outer loop
for ( i = 0; i < 512; i++ ) // inner loop
arr[i][j] = 2 * arr[i][j]; // column major processing
Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.
(CUDA Bank Conflict)
I hope this will help..
this is very good explaination ...
http://www.youtube.com/watch?v=CZgM3DEBplE
http://en.wikipedia.org/wiki/Memory_bank
and
http://mprc.pku.cn/mentors/training/ISCAreading/1989/p380-weiss/p380-weiss.pdf
from this page, you can find the detail about memory bank.
but it is a little different from what is said by #Grizzly.
in this page, the bank is like this
bank 1 2 3
address|0, 3, 6...| |1, 4, 7...| | 2, 5,8...|
hope this would help