thrust equivalent of cilk::reducer_list_append - thrust

I have a list of n intervals or domains. I would like to subdivide in parallel each interval into k parts making a new list (unordered). However, most of the subdivision won't pass certain criteria and shouldn't be added to the new list.
cilk::reducer_list_append extends the idea of parallel reduction to forming a list with push_back. This way I can collect in parallel only valid sub-intervals.
What is the thrust way of accomplishing the task? I suspect one way would be to form a large nxk list, then use parallel filter and stream compaction? But I really hope there is a reduction list append operation, because nxk can be very large indeed.

I am new to this forum but maybe you find some of these useful..
If you are not fixed upon Thrust, you can also have a look at Arrayfire.
I learned about it quite recently and it's free for that sorts of problems.
For example, with arrayfire you can evaluate selection criterion for each interval
in parallel using gfor construct, ie. consider:
// # of intervals n and # of subintervals k
const int n = 10, k = 5;
// this array represets original intervals
array A = seq(n); // A = 0,1,2,...,n-1
// for each interval A[i], subI[i] counts # of subintervals
array subI = zeros(n);
gfor(array i, n) { // in parallel for all intervals
// here evaluate your predicate for interval's subdivision
array pred = A(i)*A(i) + 1234;
subI(i) = pred % (k + 1);
}
//array acc = accum(subI);
int n_total = sum<float>(subI); // compute the total # of intervals
// this array keeps intervals after subdivision
array B = zeros(n_total);
std::cout << "total # of subintervals: " << n_total << "\n";
print(A);
print(subI);
gfor(array i, n_total) {
// populate the array of new intervals
B(i) = ...
}
print(B);
of course, it depends on a way how your intervals are represented and
which criterion you use for subdivision..

Related

find nearest non-zero element in another vector in CUDA

There is a M x N matrix A and B.(Actual size of matrix is 512 x 4096)
In each row of A, the points to be processed are set to 1.
And each row of B contains values obtained through a specific operation.
Based on each row, I am going to do an operation to get the value of B that is closest to the point of 1 in A.
The example is shown in the figure below, and the code I wrote in MATLAB was also written down.
Here's how I thought of it:
Pick the non-zero element index of A with thrust. And for each element, the closest value is fetched from the corresponding row of B by for-loop.
(If there are several non-zero elements in A, it is expected to be slow.)
I want to make good use of the power of the GPU for this operation, do you have any more efficient ideas?
[idxY,idxX] = find(A == 1);
for Point = 1:length(idxY)
pointBuf = find(B(:,idxY(Point)) == 1); // find non-zero elements in Row of B
if ~isempty(pointBuf) // there are non-zero elements in Row of B
[MinValue, MinIndex] = min(abs(pointBuf - idxY(Point)));
C(idxY(Point),idxX(Point)) = B(pointBuf(MinIndex(1)),RangeInd(Point)); // Get closest point in B
else
C(DopInd(Point),RangeInd(Point)) = 0; // if there is no non-zero elements in Row of B, just set to 0
end
end
Just as reference a solution, which shifts left and right by 4095. It has similarities with bubble sort variants, which bubble up and down at the same time.
Advantage is that it does not depend on the position of the non-null elements in B and can be easily parallelized between threads.
But the inner loop, which translates to 2 SASS instructions is still just too slow (too often called): The program takes 26ms on my notebook.
It would do so in the best and the absolute worst case of the input matrices.
Parts and methods of it probably can be reused, as it shows some CUDA programming methods.
So more or less for reference, in the end not a final (fast enough) solution:
__global__ void calcmatrix(bool* A, double* B, double* C)
{
// calculate row number
int row = blockDim.x * gridDim.y + threadIdx.y;
if (row >= 512)
return;
// store index of valid double from B, this is moved up and down
// those indices are for the current thread. Each thread is responsible for 128 consecutive columns.
int indices[128];
// prefill the indices with their own number (as if every double from B is valid)
#pragma unroll
for (int i = 0; i < 128; i++)
indices[i] = threadIdx.x * 128 + i;
// Store zero flags (4 * 32 bits) for our 128 elements
unsigned int myzeroflags[4];
// For efficiently loading data from memory, we distribute the data in another way: thread 0 gets columns 0, 32, 64, 96, ...; thread 1 gets columns 1, 33, 65, 97, ...; thread 2 gets columns 2, 34, 66, 98, ...; and so on
#pragma unroll
for (int i = 0; i < 128; i++) {
// load value from B
double in = B[row * 4096 + i * 32 + threadIdx.x];
// compare to zero (!in) and combine all bool results from the 32 threads (__ballot_sync))
unsigned int zeroflag = __ballot_sync(0xFFFFFFFF, !in);
// store the ones, which belong to us
if (threadIdx.x == i / 4)
myzeroflags[i & 3] = zeroflag;
}
// go through our zero flags and set those indices to -1 (there is already a valid index "0", so we use a negative number to signify invalid)
#pragma unroll
for (int i = 0; i < 4; i++)
#pragma unroll
for (int j = 0; j < 32; j++)
if (myzeroflags[i] & (1 << j))
indices[i * 32 + j] = -1;
// main loop, do 4095 times
#pragma unroll 1
for (int i = 0; i < 4095; i++) {
// move all elements to the left (if the index there is invalid)
// send index over thread boundaries
int fromright = __shfl_down_sync(0xFFFFFFFF, indices[0], 1, 32);
#pragma unroll
// if left index is -1, set it to one index to the right
for (int j = 0; j < 127; j++)
if (indices[j] == -1)
indices[j] = indices[j + 1];
// move over thread boundaries (except for the rightmost thread)
if (threadIdx.x != 31 && indices[127] == -1)
indices[127] = fromright;
// move to the right in the same way as to the left
int fromleft = __shfl_up_sync(0xFFFFFFFF, indices[127], 1, 32);
#pragma unroll
for (int j = 127; j > 0; j--)
if (indices[j] == -1)
indices[j] = indices[j - 1];
if (threadIdx.x != 0 && indices[0] == -1)
indices[0] = fromleft;
}
// for the other distribution of elements for memory accesses, we have to redistribute the indices to the correct threads
// To not have bank conflicts, we define the shared memory array with 33 instead of 32 elements in the last dimension, but use only 32. With this method we can put threadIdx.x into the last and previous to last dimension without bank conflicts
__shared__ short2 distribidx[8][32][33];
int indices2[128];
// Redistribute first half; the index can go from 0..4095 (and also theoreticially -1, if there was no non-null element in this row). This fits into a short, convert for faster transfer
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i]), static_cast<short>(indices[i + 32]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 0] = idxback.x;
indices2[4 * i + 1] = idxback.y;
}
__syncwarp();
// Redistribute second half
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i + 64]), static_cast<short>(indices[i + 96]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 2] = idxback.x;
indices2[4 * i + 3] = idxback.y;
}
// Do final calculation
#pragma unroll
for (int i = 0; i < 128; i++) {
// Default value is zero
double result = 0;
// Read only, if A is true and indices2 is valid
if (A[row * 4096 + i * 32 + threadIdx.x] && indices2[i] != -1)
// Read B with calculated index (this read is not optimized/coalesced, because the indices can be wild, but hopefully was or can be cached)
result = B[row * 4096 + indices2[i]];
// Store result in C
C[row * 4096 + i * 32 + threadIdx.x] = result;
}
}
int main()
{
bool* A;
double* B;
double* C;
cudaMalloc(&A, 2 * 512 * 4096);
cudaMalloc(&B, 8 * 512 * 4096);
cudaMalloc(&C, 8 * 512 * 4096);
// called in this fashion
calcmatrix<<<(512 + 7) / 8, dim3(32, 8)>>>(A, B, C);
return 0;
}
This problem is really far from being simple to implement efficiently on a GPU. The main reason is that GPUs are designed to efficiently execute SIMD-friendly algorithm while this problem can hardly be solve in a SIMD friendly way.
The naive solution you propose will be very inefficient due to the many small kernels to execute (starting a kernel is expensive and Thrust tends to run them synchronously by default AFAIK), not to mention the amount of parallelism of each kernel would be far too small for any modern GPU. I expect this solution to be slower than a naive CPU implementation.
First things first, one need to find an efficient algorithm. The proposed solution runs in O(n m²) where n is the number of row and m the number of columns. That being said, the solution should be fast (ie. close to O(n m)) if most values are non-zero which is not the case in the example.
A more efficient solution is to first iterate over the B matrix and find the location of all the non-zero items so to put it in an array L. Then you can iterate over A, track the non-zero values and search for the closest index of L matching to the location of the current item in A. If the number of items in L is big for the target row (eg. >50), you can use a binary search so to find the location faster (since items of L are sorted). This solution runs in O(n m log m) time.
An even better solution is to iterate simultaneously over A and L like a merge algorithm. Indeed, the indices of A and the items of B are both sorted so the binary search is not even needed. When the index of the current non-zero item of A is bigger than the current item of L you can iterate to the next value of L (and memorize the last value of L discarded needed to compute the closest value). This algorithm runs in O(n m) (optimal). An efficient CPU implementation consists in computing chunks of raw in each many threads.
On a GPU, things are more complex since all the previously provided algorithm are not SIMD-friendly. Computing a row in an SIMD-friendly way turns out to be complex and generally inefficient (the overhead can be higher than the serial algorithm on a CPU). One possible solution would be to compute rows in parallel (1 thread per row) and transpose the matrix block per block in shared-memory so to perform SIMD-friendly memory accesses after that (assuming there is enough space). The non-zero values of A and B certainly needs to be extracted first so to avoid thread divergence as much as possible. This solution works only if the number of non-zero is relatively uniform between the lines (otherwise I doubt a GPU can actually be helpful). Note the overhead of the transposition can be significant compared to the computation. Thus, I am not sure it will be faster than a CPU based solution. In fact, if data lies on the CPU memory, then just transferring data to the GPU will certainly be more expensive than computing the result on a CPU in parallel.

How to denormalize values to do the Harmonic Product Spectrum

I'm trying to execute the HPS algorithm and the results are not right. (48000Hz, 16bits)
I've applied to a buffer with the recorded frequency several splits, then a Hanning window, and finally the FFT.
I've obtained a peak in each FFT, that correspond with the frequency I am using, or an octave of it. But when i do the HPS, the results of the fundamental frequency are 0, because the numbers of the array where I make the sum(multiply) are too small, more than my peak in the original FFT.
This is the code of the HPS:
int i_max_h = 0;
double m_max_h = miniBuffer[0];
//m_max is the value of the peak in the original time domain array
m_max_h = m_max;
//array for the sum
double sum [] = new double[miniBuffer.length];
int fund_freq = 0;
//It could be divide by 3, but I'm not going over 500Hz, so it should works
for(int k = 0; k < 24000/48 ; k++)
{
//HPS down sampling and multiply
sum[k] = miniBuffer[k] * miniBuffer[2*k] * miniBuffer[3*k];
// find fundamental frequency (maximum value in plot)
if( sum[k] > m_max_h && k > 0 )
{
m_max_h = sum[k];
i_max_h = k;
}
}
//This should get the fundamental freq. from sum
fund_freq = (i_max_h * Fs / 24000);
System.out.print("Fundamental Freq.: ");
System.out.println(fund_freq);
System.out.println("");
The original HPS code is HERE
I don't know why the sum have little values, when it should be bigger than the previous, and the peak of the sum too. I've applied a RealFordward FFT, maybe there is a problem with the -1 to 1 range, that makes my sum decrease when I multiply it.
Any idea how to fix it, to do the HPS?
How could i do the inverse normalize?
The problem was that I was trying to get a higher value of amplitude on the sum array (the HPS array), and my set of values are normalize since I apply the FFT algorithm to them.
This is the solution I've created, multiplying the individual values of the sum array by 10 before make the multiply.
The number 10 is a coefficient that I have selected, but it could be wrong in some high frequencies cases, this coefficient could be another higher number.
'''
for(int k = 0; k < 24000/48 ; k++)
{
sum[k] = ((miniBuffer[k]*10) * (miniBuffer[2*k]*10) * (miniBuffer[3*k]*10));
// find fundamental frequency (maximum value in plot)
if( sum[k] > m_max_h && k > 0 )
{
m_max_h = sum[k];
i_max_h = k;
}
}
'''
The range of the frequencies is 24000/48 = 500, so it's between 0 and 499 Hz, more than I need in a bass.
If the split of the full array is less than 24000, i should decrease the number 48, and this is admissible, because the down sampled arrays are 24000/3 and 24000/2, so this value could decrease to 3, and it should work well.

Free-Pascal Implementation of the Sieve of Eratosthenes

My teacher gave me an assignment like this:
Using the number n given, find the largest prime number p with p<=n and n<=10^9.
I tried doing this by using the following function:
Const amax=1000000000
Var i,j,n:longint;
a:array [1..amax] of boolean;
Function lp(n:longint):longint;
Var max:longint;
Begin
For i:=1 to n do a[i]:=true;
For i:=2 to round(sqrt(n)) do
If (a[i]=true) then
For j:=1 to n div i do
If (i*i+(j-1)*i<=n) then
a[i*i+(j-1)*i]:=false;
max:=0;
i:=n;
While max=0 do
Begin
If a[i]=true then max:=i;
i:=i-1;
End;
lp:=max;
End;
This code worked flawlessly for numbers such as 1 million, but when i tried n=10^9, the program took a long time to print the output. So here's my question: Are there any ways to improve my code for lower delay? Or maybe a different code?
The most important aspect here is that the greatest prime that is not greater than n must be fairly close to n. A quick look at The Gaps Between Primes (at The Prime Pages - always worth a look for everything to do with primes) shows that for 32-bit numbers the gaps between primes cannot be greater than 335. This means that the greatest prime not greater than n must be in the range [n - 335, n]. In other words, at most 336 candidates need to be checked - for example via trial division - and this is bound to be lots faster than sieving a billion numbers.
Trial division is a reasonable choice for tasks of this kind, because the range to be scanned is so small. In my answer to Prime sieve implementation (using trial division) in C++ I analysed a couple of ways for speeding it up.
The Sieve of Eratosthenes is also a good choice, it just needs to be modified to sieve only the range of interest instead of all numbers from 1 to n. This is called a 'windowed sieve' because it sieves only a window. Since the window will most likely not contain all the primes up to the square root of n (i.e. all the primes that could be potential least prime factors of composites in the range to be scanned) it is best to sieve the factor primes via a separate, simple Sieve of Eratosthenes.
First I'm showing a simple rendition of normal (non-windowed) sieve, as a baseline for comparing the windowed code to. I'm using C# in order to show the algorithm more clearly than would be possible with Pascal.
List<uint> small_primes_up_to (uint n)
{
if (n == uint.MaxValue)
throw new ArgumentOutOfRangeException("n", "n must be less than UINT32_MAX");
var eliminated = new bool[n + 1]; // +1 because indexed by numbers
eliminated[0] = true;
eliminated[1] = true;
for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
if (!eliminated[i])
for (uint j = i * i; j <= n; j += i)
eliminated[j] = true;
return remaining_unmarked_numbers(eliminated, 0);
}
The fuction has 'small' in its name because it is not really suited for sieving big ranges; I use similar code (with a few bells and whistles) only for sieving the small factor primes needed by more advanced sieves.
The code for extracting the sieved primes is equally simple:
List<uint> remaining_unmarked_numbers (bool[] eliminated, uint sieve_base)
{
var result = new List<uint>();
for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
if (!eliminated[i])
result.Add(sieve_base + i);
return result;
}
Now, the windowed version. One difference is that the potential least factor primes need to be sieved separately (by the function just shown) as explained earlier. Another difference is that the starting point of the crossing-off sequence for a given prime may lie outside the range to be sieved. If the starting point lies before the start of the window then a bit of modulo magic is necessary to find the first 'hop' that lands in the window. From then on everything proceeds as usual.
List<uint> primes_between (uint m, uint n)
{
m = Math.Max(m, 2);
if (m > n)
return new List<uint>(); // empty range -> no primes
// index overflow in the inner loop unless `(sieve_bits - 1) + stride <= UINT32_MAX`
if (n - m > uint.MaxValue - 65521) // highest prime not greater than sqrt(UINT32_MAX)
throw new ArgumentOutOfRangeException("n", "(n - m) must be <= UINT32_MAX - 65521");
uint sieve_bits = n - m + 1;
var eliminated = new bool[sieve_bits];
foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
{
uint start = prime * prime, stride = prime;
if (start >= m)
start -= m;
else
start = (stride - 1) - (m - start - 1) % stride;
for (uint j = start; j < sieve_bits; j += stride)
eliminated[j] = true;
}
return remaining_unmarked_numbers(eliminated, m);
}
The two '-1' terms in the modulo calculation may seem strange, but they bias the logic down by 1 to eliminate the inconvenient case stride - foo % stride == stride that would need to be mapped to 0.
With this, the greatest prime not exceeding n could be computed like this:
uint greatest_prime_not_exceeding (uint n)
{
return primes_between(n - Math.Min(n, 335), n).Last();
}
This takes less than a millisecond all told, including the sieving of the factor primes and so on, even though the code contains no optimisations whatsoever. A good overview of applicable optimisations can be found in my answer to prime number summing still slow after using sieve; with the techniques shown there the whole range up to 10^9 can be sieved in about half a second.

Generating reversible permutations over a set

I want to traverse all the elements in the set Q = [0, 2^16) in a non sequential manner. To do so I need a function f(x) Q --> Q which gives the order in which the set will be sorted. for example:
f(0) = 2345
f(1) = 4364
f(2) = 24
(...)
To recover the order I would need the inverse function f'(x) Q --> Q which would output:
f(2345) = 0
f(4364) = 1
f(24) = 2
(...)
The function must be bijective, for each element of Q the function uniquely maps to another element of Q.
How can I generate such a function or are there any know functions that do this?
EDIT: In the following answer, f(x) is "what comes after x", not "what goes in position x". For example, if your first number is 5, then f(5) is the next element, not f(1). In retrospect, you probably thought of f(x) as "what goes in position x". The function defined in this answer is much weaker if used as "what goes in position x".
Linear congruential generators fit your needs.
A linear congruential generator is defined by the equation
f(x) = a*x+c (mod m)
for some constants a, c, and m. In this case, m = 65536.
An LCG has full period (the property you want) if the following properties hold:
c and m are relatively prime.
a-1 is divisible by all prime factors of m.
If m is a multiple of 4, a-1 is a multiple of 4.
We'll go with a = 5, c = 1.
To invert an LCG, we solve for f(x) in terms of x:
x = (a^-1)*(f(x) - c) (mod m)
We can find the inverse of 5 mod 65536 by the extended Euclidean algorithm, or since we just need this one computation, we can plug it into Wolfram Alpha. The result is 52429.
Thus, we have
f(x) = (5*x + 1) % 65536
f^-1(x) = (52429 * (x - 1)) % 65536
There's many approaches to solving this.
Since your set size is small, the requirement for generating the function and its inverse can simply be done via memory lookup. So once you choose your permutation, you can store the forward and reverse directions in lookup tables.
One approach to creating a permutation is mapping out all elements in an array and then randomly swapping them "enough" times. C code:
int f[PERM_SIZE], inv_f[PERM_SIZE];
int i;
// start out with identity permutation
for (i=0; i < PERM_SIZE; ++i) {
f[i] = i;
inv_f[i] = i;
}
// seed your random number generator
srand(SEED);
// look "enough" times, where we choose "enough" = size of array
for (i=0; i < PERM_SIZE; ++i) {
int j, k;
j = rand()%PERM_SIZE;
k = rand()%PERM_SIZE;
swap( &f[i], &f[j] );
}
// create inverse of f
for (i=0; i < PERM_SIZE; ++i)
inv_f[f[i]] = i;
Enjoy

Cummulative array summation using OpenCL

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}