I have 1D array "A" which is composed from many arrays "a" like this :
I'm implementing a code to sum up non consecutive segments ( sum up the numbers in the segments of the same color of each array "a" in "A" as follow:
Any ideas to do that efficiently with thrust?
Thank you very much
Note: The pictures represents only one array "a". The big array "A" contains many arrays "a"
In the general case, where the ordering of the data and grouping by segments is not known in advance, the general suggestion is to use thrust::sort_by_key to group like segments together, and then use thrust::reduce_by_key to sum the segments. Examples are given here.
However, if the input data segments follow a known repeating pattern, such as is suggested here, we can eliminate the sorting step by using a thrust::permutation_iterator to "gather" the like segments together, as the input to thrust::reduce_by_key.
Using the example data in the question, the hard part of this is to create the permutation iterator. For that, and using the specific number of segment types (3), segment lengths (3) and number of segments per segment type (3) given in the question, we need a map "vector" (i.e. iterator) for our permutation iterator that has the following sequence:
0 1 2 9 10 11 18 19 20 3 4 5 12 13 14 21 22 23 ...
This sequence would then "map" or rearrange the input array, so that all like segments are grouped together. I'm sure there are various ways to create such a sequence, but the approach I chose is as follows. We will start with the standard counting iterator sequence, and then apply a transform functor to it (using make_transform_iterator), so that we create the above sequence. I chose to do it using the following method, arranged in a stepwise sequence showing the components that are added together:
counting iterator: (_1) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
---------------------------------------------------------------------------------------------------
((_1/seg_len)%seg_types)*(seg_len*seg_types): 0 0 0 9 9 9 18 18 18 0 0 0 9 9 9 18 18 18 ...
_1%seg_len: 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ...
_1/(seg_len*seg_types)*seg_len: 0 0 0 0 0 0 0 0 0 3 3 3 3 3 3 3 3 3 ...
Sum: 0 1 2 9 10 11 18 19 20 3 4 5 12 13 14 21 22 23 ...
Here is a fully worked example:
$ cat t457.cu
#include <thrust/reduce.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <iostream>
typedef int dtype;
const int seg_len = 3;
const int seg_types = 3;
using namespace thrust::placeholders;
int main(){
dtype data[] = {10,16,14,2,4,4,1,2,1,8,2,10,3,1,6,8,0,2,9,1,0,3,5,2,3,2,1};
// 0 1 2 9 10 11 18 19 20 3 4 5 12 13 14 21 22 23 ...
// ((_1/seg_len)%seg_types)*(seg_len*seg_types) + _1%seg_len + (_1/(seg_len*seg_types)*seg_len
int ads = sizeof(data)/sizeof(data[0]);
int num_groups = ads/(seg_len*seg_types); // ads is expected to be whole-number divisible by seg_len*seg_types
int ds = num_groups*(seg_len*seg_types); // handle the case when it is not
thrust::device_vector<dtype> d_data(data, data+ds);
thrust::device_vector<dtype> d_result(seg_types);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/(ds/seg_types)), thrust::make_transform_iterator(thrust::counting_iterator<int>(ds), _1/(ds/seg_types)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), ((_1/seg_len)%seg_types)*(seg_len*seg_types) + _1%seg_len + (_1/(seg_len*seg_types)*seg_len))), thrust::make_discard_iterator(), d_result.begin());
thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<dtype>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t457 t457.cu
$ ./t457
70,30,20,
$
Related
Let's say I've got a matrix with n columns, and I've got n different functions.
Is it possible to apply i-th function per each element in i-th column efficiently, that is without using loop?
For example for the following variables:
funs = #(x) [x, cos(x), x.^2]
A = [1 0 1
2 0 2
3 0 3
4 0 4] ;
I would like to obtain the following result:
B = [1 1 1
2 1 4
3 1 9
4 1 16] ;
without looping through columns...
I am trying to calculate the Hamming weight of a vector in Matlab.
function Hamming_weight (vet_dec)
Ham_Weight = sum(dec2bin(vet_dec) == '1')
endfunction
The vector is:
Hamming_weight ([208 15 217 252 128 35 50 252 209 120 97 140 235 220 32 251])
However, this gives the following result, which is not what I want:
Ham_Weight =
10 10 9 9 9 5 5 7
I would be very grateful if you could help me please.
You are summing over the wrong dimension!
sum(dec2bin(vet_dec) == '1',2).'
ans =
3 4 5 6 1 3 3 6 4 4 3 3 6 5 1 7
dec2bin(vet_dec) creates a matrix like this:
11010000
00001111
11011001
11111100
10000000
00100011
00110010
11111100
11010001
01111000
01100001
10001100
11101011
11011100
00100000
11111011
As you can see, you're interested in the sum of each row, not each column. Use the second input argument to sum(x, 2), which specifies the dimension you want to sum along.
Note that this approach is horribly slow, as you can see from this question.
EDIT
For this to be a valid, and meaningful MATLAB function, you must change your function definition a bit.
function ham_weight = hamming_weight(vector) % Return the variable ham_weight
ham_weight = sum(dec2bin(vector) == '1', 2).'; % Don't transpose if
% you want a column vector
end % endfunction is not a MATLAB command.
I have an array like this:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
I want every non-zero elements to expand themselves one element at a time until it reaches other non-zero elements, the result is like this:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Is there any way to do this using thrust?
Is there any way to do this using thrust?
Yes, here is one possible approach.
For each position in the sequence, compute 2 distances. The first is the distance to the nearest non-zero value in the left direction, and the second is the distance to the nearest non-zero value in the right direction. If the position itself is non-zero, both left and right distances will be computed as zero. Our basic engine for this will be segmented inclusive scans, one computed in the left to right direction (to compute the distance from the left for each zero segment), and the other computed in the reverse direction (to compute the distance from the right for each zero segment). Using your example:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist:3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
Note that in each distance computation, we must special-case one end if that end does not happen to begin with a non-zero value (because the distance from that direction is "undefined"). We will special case those ? distances by assigning them large values, the reason for which will become evident in the next step.
We now will create a "map" vector, which, for each output position, allows us to select an element from the original input vector that belongs in that output position. This map vector is computed by taking the lesser of the two computed distances, and adjusting the index either from the left or the right, by that distance:
output index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist: 3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
For the map vector computation, if a left dist > a right dist then we take the output index and add a right dist to it, to produce the map vector element at that position. Otherwise, we take the output index and subtract a left dist from it. Note that the special-case ? entries above should be considered to be "arbitrarily large" for this computation. This is simulated in the code by using a large integer (1<<30).
Once we have the map vector, it's a trivial matter to use it to do a mapped copy from input to output vectors:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
out vector: 1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Here is a fully worked example:
$ cat t610.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/scan.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <iostream>
#define IVAL (1<<30)
// used to create input vector for prefix sums (distance vector computation)
struct is_zero {
template <typename T>
__host__ __device__
T operator() (T val) {
return (val) ? 0:1;
}
};
// inc and dec help with special casing of left and right ends
struct inc {
template <typename T>
__host__ __device__
T operator() (T val) {
return val+IVAL;
}
};
struct dec {
template <typename T>
__host__ __device__
T operator() (T val) {
return val-IVAL;
}
};
// this functor is lifted from thrust example code
// and is used to enable segmented scans based on flag delimitors
// BinaryPredicate for the head flag segment representation
// equivalent to thrust::not2(thrust::project2nd<int,int>()));
template <typename HeadFlagType>
struct head_flag_predicate : public thrust::binary_function<HeadFlagType,HeadFlagType,bool>
{
__host__ __device__
bool operator()(HeadFlagType left, HeadFlagType right) const
{
return !right;
}
};
// distance tuple ordering is left (0), then right (1)
struct map_functor
{
template <typename T>
__host__ __device__
int operator() (T dist){
int leftdist = thrust::get<0>(dist);
int rightdist = thrust::get<1>(dist);
int idx = thrust::get<2>(dist);
return (leftdist > rightdist) ? (idx+rightdist):(idx-leftdist);
}
};
int main(){
int h_a[] = { 0, 0, 0, 1, 0, 0, 0, 0, 5, 0, 0, 3, 0, 0, 0, 8, 0, 0 };
int n = sizeof(h_a)/sizeof(h_a[0]);
thrust::device_vector<int> a(h_a, h_a+n);
thrust::device_vector<int> az(n);
thrust::device_vector<int> asl(n);
thrust::device_vector<int> asr(n);
thrust::transform(a.begin(), a.end(), az.begin(), is_zero());
// set up distance from the left vector (asl)
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(),inc(), is_zero());
thrust::transform(a.begin(), a.begin()+1, a.begin(), inc());
thrust::inclusive_scan_by_key(a.begin(), a.end(), az.begin(), asl.begin(), head_flag_predicate<int>());
thrust::transform(a.begin(), a.begin()+1, a.begin(), dec());
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(), dec(), is_zero());
// set up distance from the right vector (asr)
thrust::device_vector<int> ra(n);
thrust::sequence(ra.begin(), ra.end(), n-1, -1);
thrust::transform_if(az.end()-1, az.end(), a.end()-1, az.end()-1, inc(), is_zero());
thrust::transform(a.end()-1, a.end(), a.end()-1, inc());
thrust::inclusive_scan_by_key(thrust::make_permutation_iterator(a.begin(), ra.begin()), thrust::make_permutation_iterator(a.begin(), ra.end()), thrust::make_permutation_iterator(az.begin(), ra.begin()), thrust::make_permutation_iterator(asr.begin(), ra.begin()), head_flag_predicate<int>());
thrust::transform(a.end()-1, a.end(), a.end()-1, dec());
// create combined map vector
thrust::device_vector<int> map(n);
thrust::counting_iterator<int> idxbegin(0);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(asl.begin(), asr.begin(), idxbegin)), thrust::make_zip_iterator(thrust::make_tuple(asl.end(), asr.end(), idxbegin+n)), map.begin(), map_functor());
// use map to create output
thrust::device_vector<int> result(n);
thrust::copy(thrust::make_permutation_iterator(a.begin(), map.begin()), thrust::make_permutation_iterator(a.begin(), map.end()), result.begin());
// display results
std::cout << "Input vector:" << std::endl;
thrust::copy(a.begin(), a.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "Output vector:" << std::endl;
thrust::copy(result.begin(), result.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -arch=sm_20 -o t610 t610.cu
$ ./t610
Input vector:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
Output vector:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
$
Notes:
The above implementation probably has areas that can be improved on, particularly with respect to fusion of operations. However, for understanding purposes, I think fusion makes the code a bit harder to read.
I have really only tested it on the particular example you gave. There may be bugs that you will uncover. My purpose is not to give you a black-box library function that you use but don't understand, but rather to teach you how to write your own code that does what you want.
The "ambiguity" pointed out by JackOLantern is still present in your problem statement. I have obscured it by choosing my map functor behavior to mimic the output you indicated as desired, but simply by creating an equally valid but opposite realization of the map functor (using "if a left dist < a right dist then ..." instead) I can cause the result between 3 and 8 to take the other possible outcome/state. Your comment that "if there is an ambiguity, whoever reaches the position first fill its value to that space" makes no sense to me, unless by that you mean "I don't care which outcome you provide." There is no concept of a particular thread reaching a particular point first. Threads (and blocks) can execute in any order, and this order can change from device to device, and run to run.
From online documentation:
cudaError_t cudaMemset (void * devPtr, int value, size_t count )
Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Parameters:
devPtr - Pointer to device memory
value - Value to set for each byte of specified memory
count - Size in bytes to set
This description doesn't appear to be correct as:
int *dJunk;
cudaMalloc((void**)&dJunk, 32*(sizeof(int));
cudaMemset(dJunk, 0x12, 32);
will set all 32 integers to 0x12, not 0x12121212. (Int vs Byte)
The description talks about setting bytes. Count and Value are described in terms of bytes. Notice count is of type size_t, and value is of type int. i.e. Set a byte-size to an int-value.
cudaMemset() is not mentioned in the prog guide.
I have to assume the behavior I am seeing is correct, and the documentation is bad.
Is there a better documentation source out there? (Where?)
Are other types supported? i.e. Would float *dJunk; work? Others?
The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz:
#include <cstdio>
int main(void)
{
const int n = 32;
const size_t sz = size_t(n) * sizeof(int);
int *dJunk;
cudaMalloc((void**)&dJunk, sz);
cudaMemset(dJunk, 0, sz);
cudaMemset(dJunk, 0x12, 32);
int *Junk = new int[n];
cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d %x\n", i, Junk[i]);
}
cudaDeviceReset();
return 0;
}
produces
$ nvcc memset.cu
$ ./a.out
0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
ie. all 128 bytes set to 0, then first 32 bytes set to 0x12. Exactly as described by the documentation.
Let A be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.
It is possible to tackle though if you build up the solution recursively and use conditional probabilities.
Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.
In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.
Apologies for the VBA code, but VBA was all I had available when motivated to answer :).
Function nCr#(ByVal n#, ByVal r#)
Static combin#()
Static size#
Dim i#, j#
If n = r Then
nCr = 1
Exit Function
End If
If n > size Then
ReDim combin(0 To n, 0 To n)
combin(0, 0) = 1
For i = 1 To n
combin(i, 0) = 1
For j = 1 To i
combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
Next
Next
size = n
End If
nCr = combin(n, r)
End Function
Function p_binom#(n#, r#, p#)
p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function
Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
bucket#, total_buckets#, bucket_capacity#)
If balls > bucket_capacity Then
p_next_bucket_balls = 0
Else
p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
End If
End Function
Function p_capped_buckets#(n#, cap#)
Dim p_prior, p_update
Dim bucket#, balls#, prior_balls#
ReDim p_prior(0 To n)
ReDim p_update(0 To n)
p_prior(0) = 1
For bucket = 1 To n
For balls = 0 To n
p_update(balls) = 0
For prior_balls = 0 To balls
p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
Next
Next
p_prior = p_update
Next
p_capped_buckets = p_update(n)
End Function
Function expected_max_buckets#(n#)
Dim cap#
For cap = 0 To n
expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
Next
End Function
Sub test32()
Dim p_cumm#(0 To 32)
Dim cap#
For cap# = 0 To 32
p_cumm(cap) = p_capped_buckets(32, cap)
Next
For cap = 1 To 32
Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
Next
End Sub
For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.
Output to compare to ahmad's:
1 0.000000
2 0.029273
3 0.516311
4 0.361736
5 0.079307
6 0.011800
7 0.001417
8 0.000143
9 0.000012
10 0.000001
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 0.000000
21 0.000000
22 0.000000
23 0.000000
24 0.000000
25 0.000000
26 0.000000
27 0.000000
28 0.000000
29 0.000000
30 0.000000
31 0.000000
32 0.000000
I'll try a math answer, although I don't have it quite right yet.
You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).
For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000
int main(){
int i=0,j=0;
int *bank=(int*)malloc(sizeof(int)*NBANK);
int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
memset(maxCount, 0, sizeof(int)*(NBANK+1));
for (int i=0; i<NSAMPLE; ++i) {
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(bank, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
bank[randomNumber[j]]++;
}
for(j=0; j<WARPSIZE; j++)
max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);
// store statistic
maxCount[max_bank_conflict]++;
}
// report statistic
printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
float expected = 0;
for(i=1; i<NBANK+1; i++) {
float prob = maxCount[i]/(float)NSAMPLE;
printf("%02d -> %6.4f\n", i, prob);
expected += prob * i;
}
printf("Expected maximum bank conflict degree = %6.4f\n", expected);
return 0;
}
Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.
Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.
This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.
I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32
#define NSAMPLE 10000
int main(){
srand ( time(NULL) );
int i=0,j=0;
int *conflictCheck=NULL;
int *randomNumber=NULL;
int *statisticCheck=NULL;
conflictCheck=(int*)malloc(sizeof(int)*NBANK);
randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
while(i<NSAMPLE){
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(conflictCheck, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
conflictCheck[randomNumber[j]]++;
max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
}
// store statistic
statisticCheck[max_bank_conflict]++;
// next iter
i++;
}
// report statistic
printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
for(i=0; i<NBANK+1; i++){
//
printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
}
return 0;
}
I got following output:
Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).