note "When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions".
but I have some questions.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
b[i] = a[i] + a[i + 1] + a[i + 2];
}
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
or does the coalesced only exist in the different thread(transverse) of a warp?(no exist in a thread?)
I have read the similar questionss:
From non coalesced access to coalesced memory access CUDA
But I still don't understand,so is it non-coalesced memory access?
2.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
b[i] = a[i] + a[i + 10] + a[i + 12];//assuming no out of indeax
}
It may can be the non-coalesced access.
so I change the code to:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ double shareM[3*BLOCK_SIZE];
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
b[i] = shareM[threadIdx.x] + shareM[threadIdx.x + 1] + shareM[threadIdx.x + 2];
}
I see that coalescent access do not matter with shared memory.
but it mean that is the way below coalesced access under one thread?
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
or does the shared memory coalesced access only exist in diferent thread like the fllowing example?:
thread0:
shareM[0] = a[3]
thread1:
shareM[4] = a[23]
thread2:
shareM[7] = a[56]
3.I that don't understand "coalescent access do not matter with shared memory".
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ?
if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
thank you.
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
When working with GPU kernels, I guess it's better to think everything in a parallel way. Every instruction is executed in a group of 32 threads, a.k.a a warp, so they are actually not just three accesses(here the word "access" is also vague, I assume you mean array accessing), they are 32 x 3 = 96 accesses in total. A more correct way to say this is that they are three array accesses per thread.
According to [1-3], the coalesced accessing pattern is a behavior in terms of a warp:
When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
So, we need to think respectively for these three array accesses. Let's rewrite the code as:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
double ai = a[i]; // <1>
double ai1 = a[i + 1]; // <2>
double ai2 = a[i + 2]; // <3>
b[i] = ai + ai1 + ai2;
}
And it is succient to only consider the first warp with threadid range from 0 to 31.
<1>: Each thread in a warp allocates a double variable called ai in its register and wants to access a value from a based on the index i. Note the original i \in [0,31] and then it's multiped by 3, so the warp is accessing a[0], a[3], ... , a[93]. Since a is a double array(i.e. every entry is of size 8 byte), it needs to access 32 * 8 = 256 byte in total, that's two 128-byte segments that can be dealt with two 128-byte memory transactions. According to [4]:
If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently: Two memory requests, one for each half-warp, if the size is 8 bytes, Four memory requests, one for each quarter-warp, if the size is 16 bytes.
to load these 256-byte data from global memory to register, the minimum memory request number is 2. If a can be accessed in this way, then this accessing pattern is coalescing. But apparently the pattern used in <1> is not, it's like the graph below:
<1>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+---+-------+----+--------+-------+--------+-----+--+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
32 threads in the warp are accessing memory separately in six 128-byte segments. In the cached mode, it needs six 128-byte memory transactions at least. That's 768 bytes in total, but only 256 bytes are useful. The bus utilization is about 1/3.
<2>: This is very similar to <1>, with 1 offset from the start:
<2>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
++---+---+---+---+--------+-------+--------+------+-+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
<3>: This is very similar to <1>, with 2 offset from the start:
<3>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+-+---+---+---+--+--------+-------+--------+-------++-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
I think now you already get the idea and probably think: How about loading these 768 bytes from global memory in one pass because all of them are used once, exactly. However, recall that each thread has its private registers and these registers cannot communicate with each other([5]), so this cannot be done merely with registers and that's where shared memory comes in.
(warp1) (warp2) (warp3)
+ + +
| | |
t0 | t31 | t0 | t31
+-+-+-+---+-+-+-+---------+---------+-+-+-+++-+-+-+-+
| | | | | | | | | ...... | | | | | | | | |
v v v v v v v v v v v v v v v v v v
+-+-+-+---+-+-+-++--------+-------+-+-+-+-+++-+-+-+---
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ? if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
AFAICT, you cannot directly transfer data from global memory to shared memory.
References:
[1]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-memory-throughput
[2]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
[3]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0__examples-of-global-memory-accesses
[4]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
[5]. I lied, there is a way to do this by using __shlf intrinsics.
Related
I've read as a beginner that using a 2D block of threads is the simplest way to deal with a 2D dataset. I am trying to implement the following matrix operations in sequence:
Swap elements at odd and even positions of each row in the matrix
1 2 2 1
3 4 becomes 4 3
Reflect the elements of the matrix across the principal diagonal
2 1 2 4
4 3 becomes 1 3
To implement this, I wrote the following kernel:
__global__ void swap_and_reflect(float *d_input, float *d_output, int M, int N)
{
int j = threadIdx.x;
int i = threadIdx.y;
for(int t=0;t<M*N;t++)
d_output[t] = d_input[t];
float temp = 0.0;
if (j%2 == 0){
temp = d_output[j];
d_output[j] = d_output[j+1];
d_output[j+1] = temp;
}
__syncthreads(); // Wait for swap to complete
if (i!=j){
temp = d_output[i];
d_output[i] = d_output[j];
d_output[j] = temp;
}
}
The reflection does not happen as expected. But at this point, I am tending to find myself confused with the 2D structure of the executing threads with the 2D structure of the matrix itself.
Could you please correct my understanding of the multi-dimensional arrangement of threads and how it correlates to the dimensionality of the data itself? I believe this is the reason why I have the reflection part of it incorrect.
Any pointers/resources that could help me visualize/understand this correctly would be of immense help.
Thank you for reading.
The thread indices are laid out in your hypothetical 4x4 block in (x,y) pairs as
(0,0) (0,1)
(1,0) (1,1)
and the ordering is
thread ID (x,y) pair
--------- ----------
0 (0,0)
1 (1,0)
2 (0,1)
3 (1,1)
You need to choose an ordering for your array in memory and then modify your kernel accordingly, for example:
if (i!=j){
temp = d_output[i+2*j];
d_output[i+2*j] = d_output[j+2*i];
d_output[j+2*i] = temp;
}
let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then
means.[col] <- column_mean_partial
column_mean_shared := column_mean_partial
__syncthreads()
let column_mean = !column_mean_shared
The above code snippet calculates the mean of every column of a 2d matrix. As only thread 0 of a block has the full value I store it to shared memory column_mean_shared, use __syncthreads() and then broadcast it to all the threads in a block as I need them to have that value in order to calculate the variance.
Would there be a better way to broadcast the value or is the above efficient enough already?
I hadn't expected much when I posted this question, but it turns out for large matrices such as 128x10000 for example there is a much better way. I wrote a warpReduce kernel that has the block size of 32, which allows it to do the whole reduction using shuffle xor.
For a 128x100000 for 100 iterations the first version that used 64 blocks per grid (and 32 threads per block) took 0.5s. For the the CUB row reduce it took 0.25s.
When I increased the blocks per grid to 256, I got a nearly 4x speedup to about 1.5s. At 384 blocks per thread it takes 1.1s and increasing the number of blocks does not seem to improve performance from there.
For the problem sizes that I am interested in, the improvements are not nearly as dramatic.
For the 128x1024 and 128x512 cases, 10000 iterations:
For 1024: 1s vs 0.82s in favor of warpReduce.
For 512: 0.914s vs 0.873s in favor of warpReduce.
For small matrices, any speedups from parallelism are eaten away by kernel launch times it seems.
For 256: 0.94s vs 0.78s in favor of warpReduce.
For 160: 0.88s vs 0.77s in favor of warpReduce.
It was tested using a GTX 970.
It is likely that for the Kepler and earlier nVidia cards the figures would have been different as in Maxwell the block limit per grid was raised from 32 to 64 per SM which improves multiprocessor occupancy.
I am satisfied with this as the performance improvement is nice and I actually hadn't been able to write a kernel that uses shared memory correctly before reaching for the Cub block reduce. I forgot how painful Cuda can be sometimes. It is amazing that the version that uses no shared memory is so competitive.
Here are the two modules that I tested with:
type rowModule(target) =
inherit GPUModule(target)
let grid_size = 64
let block_size = 128
let blockReducer = BlockReduce.RakingCommutativeOnly<float32>(dim3(block_size,1,1),worker.Device.Arch)
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
let temp_storage = blockReducer.TempStorage.AllocateShared()
let column_mean_shared = __shared__.Variable()
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then
means.[col] <- column_mean_partial
column_mean_shared := column_mean_partial
__syncthreads()
let column_mean = !column_mean_shared
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then stds.[col] <- sqrt(variance_sum)
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <# this.Kernel #> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr
type rowWarpModule(target) =
inherit GPUModule(target)
let grid_size = 384
let block_size = 32
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let inline butterflyWarpReduce (value:float32) =
let v1 = value + __shfl_xor value 16 32
let v2 = v1 + __shfl_xor v1 8 32
let v3 = v2 + __shfl_xor v2 4 32
let v4 = v3 + __shfl_xor v3 2 32
v4 + __shfl_xor v4 1 32
let column_mean = (butterflyWarpReduce acc) / (float32 num_rows)
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = (butterflyWarpReduce acc) / (float32 num_rows)
if threadIdx.x = 0
then stds.[col] <- sqrt(variance_sum)
means.[col] <- column_mean
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <# this.Kernel #> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr
I am very new to parallel programming and stack overflow. I am working on a matrix multiplication implementation using CUDA. I am using column order float arrays as matrix representations.
The algorithm I developed is a bit unique and goes as follows. Given a matrix an n x m matrix A and an m x k matrix B, I launch an n x k blocks with m threads in each block. Essentially, I launch a block for every entry in the resulting matrix, with each thread computing one multiplication for that entry. For example,
1 0 0 0 1 2
0 1 0 * 3 4 5
0 0 1 6 7 8
For the first entry in the resulting matrix I would launch each thread with
thread 0 computing 1 * 3
thread 1 computing 0 * 0
thread 2 computing 0 * 1
With each thread adding to a 0-initialized matrix.
Right now, I am not getting a correct answer. I am getting this over and over again
0 0 2
0 0 5
0 0 8
My kernel function is below. Could this be a thread synchronization problem or am I screwing up array indexing or something?
/*#param d_A: Column order matrix
*#param d_B: Column order matrix
*#param d_result: 0-initialized matrix that kernels write to
*#param dim_A: dimensionality of A (number of rows)
*#param dim_B: dimensionality of B (number of rows)
*/
__global__ void dot(float *d_A, float *d_B, float *d_result, int dim_A, int dim_B) {
int n = blockIdx.x;
int k = blockIdx.y;
int m = threadIdx.x;
float a = d_A[(m * dim_A) + n];
float b = d_B[(k * dim_B) + m];
//d_result[(k * dim_A) + n] += (a * b);
syncthreads();
float temp = d_result[(k*dim_A) + n];
syncthreads();
temp = temp + (a * b);
syncthreads();
d_result[(k*dim_A) + n] = temp;
syncthreads();
}
The whole idea of using syncthreads() is wrong in this case. This API call has a block scope.
1. syncthreads();
2. float temp = d_result[(k*dim_A) + n];
3. syncthreads();
4. temp = temp + (a * b);
5. syncthreads();
6. d_result[(k*dim_A) + n] = temp;
7. syncthreads();
The local variable float temp; has thread scope and using this synchronization barrier is senseless.
The pointer d_result is global memory pointer and using this synchronization barrier is also senseless. Note that there isn't available yet (maybe there will never be available) a barrier which synchronizes threads globally.
Typically the usage of syncthreads() is required when shared memory is used for computation. In this case you may want to use shared memory. Here you could see an example of how to use shared memory and syncthreads() properly. Here you have an example of matrix multiplication with shared memory.
I'm trying to do matrix multiplication in cuda. My implementation is different from the cuda example.
The cuda example (from the cuda samples) performs matrix multiplication by multiplying each value in the row of the first matrix by each value in the column of the second matrix, then summing the products and storing it in an output vector at the index of the row from the first matrix.
My implementation multiplies each value in the column of the first matrix by the single value of the row of the second matrix, where the row index = column index. It then has an output vector in global memory that has each of its indices updated.
The cuda example implementation can have a single thread update each index in the output vector, whereas my implementation can have multiple threads updating each index.
The results that I get show only some of the values. For example, if I had it do 4 iterations of updates, it would only do 2 or 1.
I think that the threads might be interfering with each other since they're all trying to write to the same indices of the vector in global memory. So maybe, while one thread is writing to an index, the other might not be able to insert its value and update the index?
Just wondering if this assessment makes sense.
For example. To multiply the following two matrices:
[3 0 0 2 [1 [a
3 0 0 2 x 2 = b
3 0 0 0 3 c
0 1 1 0] 4] d]
The Cuda sample does matrix multiplication in the following way using 4 threads where a,b,c,d are stored in global memory:
Thread 0: 3*1 + 0*2 + 0*3 + 2*4 = a
Thread 1: 3*1 + 0*2 + 0*3 + 2*4 = b
Thread 2: 3*1 + 0*2 + 0*3 + 0*4 = c
Thread 3: 0*1 + 1*2 + 1*3 + 0*4 = d
My implementation looks like this:
a = b = c = d = 0
Thread 0:
3*1 += a
3*1 += b
3*1 += c
0*1 += d
Thread 1:
0*2 += a
0*2 += b
0*2 += c
1*2 += d
Thread 2:
0*3 += a
0*3 += b
0*3 += c
1*3 += d
Thread 3:
2*4 += a
2*4 += b
0*4 += c
0*4 += d
So at one time all four threads could be trying to update one of the indices.
In order to fix this issue, I used atomicAdd to do the += operation. When a thread performs the operation 3*1 += a (for example), it does three things.
It gets the previous value of a
It updates the value by doing 3*1 + previous value of a
It then stores the new value into a
By using atomicAdd it guarantees that these operations can occur by the thread without interruption from other threads. If atomicAdd is not used, thread0 could get the previous value of a and while thread0 is updating the value, thread1 could get the previous value of a and perform its own update. In this way a += operation would not occur because the threads aren't able to finish their operations.
If a += 3*1 is used instead of atomicAdd(&a, 3*1), then it is possible for thread1 to interfere and change the value of thread0 before thread0 finishes what it's doing. It creates a race condition.
atomicAdd is a += operation. You would use the following code to perform the operation:
__global__ void kernel(){
int a = 0;
atomicAdd(&a, 3*1); //is the same as a += 3*1
}
Let A be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.
It is possible to tackle though if you build up the solution recursively and use conditional probabilities.
Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.
In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.
Apologies for the VBA code, but VBA was all I had available when motivated to answer :).
Function nCr#(ByVal n#, ByVal r#)
Static combin#()
Static size#
Dim i#, j#
If n = r Then
nCr = 1
Exit Function
End If
If n > size Then
ReDim combin(0 To n, 0 To n)
combin(0, 0) = 1
For i = 1 To n
combin(i, 0) = 1
For j = 1 To i
combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
Next
Next
size = n
End If
nCr = combin(n, r)
End Function
Function p_binom#(n#, r#, p#)
p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function
Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
bucket#, total_buckets#, bucket_capacity#)
If balls > bucket_capacity Then
p_next_bucket_balls = 0
Else
p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
End If
End Function
Function p_capped_buckets#(n#, cap#)
Dim p_prior, p_update
Dim bucket#, balls#, prior_balls#
ReDim p_prior(0 To n)
ReDim p_update(0 To n)
p_prior(0) = 1
For bucket = 1 To n
For balls = 0 To n
p_update(balls) = 0
For prior_balls = 0 To balls
p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
Next
Next
p_prior = p_update
Next
p_capped_buckets = p_update(n)
End Function
Function expected_max_buckets#(n#)
Dim cap#
For cap = 0 To n
expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
Next
End Function
Sub test32()
Dim p_cumm#(0 To 32)
Dim cap#
For cap# = 0 To 32
p_cumm(cap) = p_capped_buckets(32, cap)
Next
For cap = 1 To 32
Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
Next
End Sub
For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.
Output to compare to ahmad's:
1 0.000000
2 0.029273
3 0.516311
4 0.361736
5 0.079307
6 0.011800
7 0.001417
8 0.000143
9 0.000012
10 0.000001
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 0.000000
21 0.000000
22 0.000000
23 0.000000
24 0.000000
25 0.000000
26 0.000000
27 0.000000
28 0.000000
29 0.000000
30 0.000000
31 0.000000
32 0.000000
I'll try a math answer, although I don't have it quite right yet.
You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).
For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000
int main(){
int i=0,j=0;
int *bank=(int*)malloc(sizeof(int)*NBANK);
int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
memset(maxCount, 0, sizeof(int)*(NBANK+1));
for (int i=0; i<NSAMPLE; ++i) {
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(bank, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
bank[randomNumber[j]]++;
}
for(j=0; j<WARPSIZE; j++)
max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);
// store statistic
maxCount[max_bank_conflict]++;
}
// report statistic
printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
float expected = 0;
for(i=1; i<NBANK+1; i++) {
float prob = maxCount[i]/(float)NSAMPLE;
printf("%02d -> %6.4f\n", i, prob);
expected += prob * i;
}
printf("Expected maximum bank conflict degree = %6.4f\n", expected);
return 0;
}
Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.
Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.
This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.
I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32
#define NSAMPLE 10000
int main(){
srand ( time(NULL) );
int i=0,j=0;
int *conflictCheck=NULL;
int *randomNumber=NULL;
int *statisticCheck=NULL;
conflictCheck=(int*)malloc(sizeof(int)*NBANK);
randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
while(i<NSAMPLE){
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(conflictCheck, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
conflictCheck[randomNumber[j]]++;
max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
}
// store statistic
statisticCheck[max_bank_conflict]++;
// next iter
i++;
}
// report statistic
printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
for(i=0; i<NBANK+1; i++){
//
printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
}
return 0;
}
I got following output:
Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).