thrust::unique_by_key eating up last element - cuda

Please consider the below simple code:
thrust::device_vector<int> positions(6);
thrust::sequence(positions.begin(), positions.end());
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator > end;
//copyListOfNgramCounteachdoc contains: 0,1,1,1,1,3
end.first = copyListOfNgramCounteachdoc.begin();
end.second = positions.begin();
for(int i =0 ; i < numDocs; i++){
end= thrust::unique_by_key(end.first, end.first + 3,end.second);
}
int length = end.first - copyListOfNgramCounteachdoc.begin() ;
cout<<"the value of end -s is: "<<length;
for(int i =0 ; i< length ; i++){
cout<<copyListOfNgramCounteachdoc[i];
}
I expected the output to be 0,1,1,3 of this code; however, the output is 0,1,1. Can anyone let me know what I am missing? Note: the contents of copyListOfNgramCounteachdoc is 0,1,1,1,1,3 . Also the type of copyListOfNgramCounteachdoc is thrust::device_vector<int>.
EDIT:
end.first = storeNcCounts.begin();
end.second = storeCompactedPositions.begin();
int indexToWriteForIndexesarr = 0;
for(int i =0 ; i < numDocs; i++){
iter = end.first;
end = thrust::unique_by_key_copy(copyListOfNgramCounteachdoc.begin() + (i*numUniqueNgrams), copyListOfNgramCounteachdoc.begin()+(i*numUniqueNgrams)+ numUniqueNgrams,positions.begin() + (i*numUniqueNgrams),end.first,end.second);
int numElementsCopied = (end.first - iter);
endIndex = beginIndex + numElementsCopied - 1;
storeBeginIndexEndIndexSCNCtoRead[indexToWriteForIndexesarr++] = beginIndex;
storeBeginIndexEndIndexSCNCtoRead[indexToWriteForIndexesarr++] = endIndex;
beginIndex = endIndex + 1;
}

I think what you want to use in this case is thrust::unique_by_key_copy, but read on.
The problem is that unique_by_key is not updating your input array unless it has to. In the case of the first call, it can return a sequence of unique keys by just dropping the duplicate 1 -- by moving the returned iterator forward, without actually compacting the input array.
You can see what is happening if you replace your loop with this one:
end.first = copyListOfNgramCounteachdoc.begin();
end.second = positions.begin();
thrust::device_vector<int>::iterator iter;
for(int i =0 ; i < numDocs; i++){
cout <<"before ";
for(iter = end.first; iter != end.first+3; iter++) cout<<*iter;
end = thrust::unique_by_key(end.first, end.first + 3,end.second);
cout <<" after ";
for(iter = copyListOfNgramCounteachdoc.begin(); iter != end.first; iter++) cout<<*iter;
cout << endl;
for(int i =0 ; i< 6; i++) cout<<copyListOfNgramCounteachdoc[i];
cout << endl;
}
For this code I get this output:
before 011 after 01
011223
before 122 after 0112
011223
You can see that the values in copyListofNgramCounteachdoc are not changing. This is valid behavior. If you had used unique_by_key_copy instead of unique_by_key then Thrust would have been forced to actually compact the values in order to guarantee uniqueness, but in this case since there are only two values in each sequence, there is no need. The docs say:
The return value is an iterator new_last such that no two consecutive elements in the range [first, new_last) are equal. The iterators in the range [new_last, last) are all still dereferenceable, but the elements that they point to are unspecified. unique is stable, meaning that the relative order of elements that are not removed is unchanged.
If you use unique_by_key_copy, then Thrust will be forced to copy the unique keys and values (with obvious cost implications), and you should see the behavior you were expecting.
BTW, if you can do this in a single call to unique_by_key rather than doing them in a loop, I suggest that you do so.

Related

Why does my set of pair<int,int> contains duplicates?

I am puzzled with a set of pair<int,int> that contains inconsistent duplicates. The first elements of the set are inserted correctly then, after inserting a kth pair, all pairs in the set are replaced with duplicate pairs of new int values.
The set is filled within a loop as follows
//the coordinates are stored in vector of pair<double,double> with dimension N :
vector<pair<double,double> > coordPairs(N);
//populate "coordPairs"
...
std::set <std::pair<int,int> > occupiedCellList;
for(int i=0; i<coordPairs.size(); i++)
{
double x, y;
x = coordPairs[i].first;
y = coordPairs[i].second;
int row = (y - ymin) / cellSizeInMeters;
int col = (x - xmin) / cellSizeInMeters;
occupiedCellList.insert(make_pair(row,col));
}
Even when I use floor or trunc in the expressions of row and col the set still contains duplicates. How can this behaviour be explained ?
Thanks.
I solved my problem.
What appeared to be duplicates was not. The "error" came from the debugger in QT creator that wrongly displays set elements.
When displaying with std::cout element of the set using set iterator, all elements were OK.
set<pair<int,int>> s;
...
//populate elements of s
...
set<pair<int,int>>::iterator it = s.begin();
while(it!=s.end())
{
cout<< it->first << ", " << it->second << endl;
it++;
}

thrust doesn't provided the expected result using thrust::minimum

consider the following code, when p is a pointer allocated GPU-side.
thrust::device_ptr<float> pWrapper(p);
thrust::device_ptr<float> fDevPos = thrust::min_element(pWrapper, pWrapper + MAXX * MAXY, thrust::minimum<float>());
fRes = *fDevPos;
*fDicVal = fRes;
after applying the same thing on cpu side.
float *hVec = new float[MAXX * MAXY];
cudaMemcpy(hVec, p, MAXX*MAXY*sizeof(float), cudaMemcpyDeviceToHost);
float min = 999;
int index = -1;
for(int i = 0 ; i < MAXX* MAXY; i++)
{
if(min > hVec[i])
{
min = hVec[i];
index = i;
}
}
printf("index :%d a wrapper : %f, as vectorDevice : %f\n",index, fRes, min);
delete hVec;
i get that min != fRes. what am i doing wrong here?
thrust::minimum_element requires the user to supply a comparison predicate. That is, a function which answers the yes-or-no question "is x smaller than y?"
thrust::minimum is not a predicate; it answers the question "which of x or y is smaller?".
To find the smallest element using minimum_element, pass the thrust::less predicate:
ptr_to_smallest_value = thrust::min_element(first, last, thrust::less<T>());
Alternatively, don't pass anything. thrust::less is the default:
ptr_to_smallest_value = thrust::min_element(first, last);
If all you're interested in is the value of the smallest element (not an iterator pointing to the smallest element), you can combine thrust::minimum with thrust::reduce:
smallest_value = thrust::reduce(first, last, std::numeric_limits<T>::max(), thrust::minimum<T>());

OCR: weighted Levenshtein distance

I'm trying to create an optical character recognition system with the dictionary.
In fact I don't have an implemented dictionary yet=)
I've heard that there are simple metrics based on Levenstein distance which take in account different distance between different symbols. E.g. 'N' and 'H' are very close to each other and d("THEATRE", "TNEATRE") should be less than d("THEATRE", "TOEATRE") which is impossible using basic Levenstein distance.
Could you help me locating such metric, please.
This might be what you are looking for: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance (and kindly some working code is included in the link)
Update:
http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html
Here is an example (C#) where weight of "replace character" operation depends on distance between character codes:
static double WeightedLevenshtein(string b1, string b2) {
b1 = b1.ToUpper();
b2 = b2.ToUpper();
double[,] matrix = new double[b1.Length + 1, b2.Length + 1];
for (int i = 1; i <= b1.Length; i++) {
matrix[i, 0] = i;
}
for (int i = 1; i <= b2.Length; i++) {
matrix[0, i] = i;
}
for (int i = 1; i <= b1.Length; i++) {
for (int j = 1; j <= b2.Length; j++) {
double distance_replace = matrix[(i - 1), (j - 1)];
if (b1[i - 1] != b2[j - 1]) {
// Cost of replace
distance_replace += Math.Abs((float)(b1[i - 1]) - b2[j - 1]) / ('Z'-'A');
}
// Cost of remove = 1
double distance_remove = matrix[(i - 1), j] + 1;
// Cost of add = 1
double distance_add = matrix[i, (j - 1)] + 1;
matrix[i, j] = Math.Min(distance_replace,
Math.Min(distance_add, distance_remove));
}
}
return matrix[b1.Length, b2.Length] ;
}
You see how it works here: http://ideone.com/RblFK
A few years too late but the following python package (with which I am NOT affiliated) allows for arbitrary weighting of all the Levenshtein edit operations and ASCII character mappings etc.
https://github.com/infoscout/weighted-levenshtein
pip install weighted-levenshtein
Also this one (also not affiliated):
https://github.com/luozhouyang/python-string-similarity
I've recently created a python package that does exactly that https://github.com/zas97/ocr_weighted_levenshtein.
In my Weigthed-Levenshtein implementation the distance between "THEATRE" and "TNEATRE" is 1.3 while the distance between "THEATRE" and "TOEATRE" is 1.42.
Other exemples are the d("O", "0") is 0.06 and d("e","c") is 0.57.
This distances have been calculated by running multiple ocrs in a synthetic dataset and doing statistics on the most common ocr errors. I hope it helps someone =)

Removal of every 'kth' person from a circle. Find the last remaining person

As part of a recent job application I was asked to code a solution to this problem.
Given,
n = number of people standing in a circle.
k = number of people to count over each time
Each person is given a unique (incrementing) id. Starting with the first person (the lowest id), they begin counting from 1 to k.
The person at k is then removed and the circle closes up. The next remaining person (following the eliminated person) resumes counting at 1. This process repeats until only one person is left, the winner.
The solution must provide:
the id of each person in the order they are removed from the circle
the id of the winner.
Performance constraints:
Use as little memory as possible.
Make the solution run as fast as possible.
I remembered doing something similar in my CS course from years ago but could not recall the details at the time of this test. I now realize it is a well known, classic problem with multiple solutions. (I will not mention it by name yet as some may just 'wikipedia' an answer).
I've already submitted my solution so I'm absolutely not looking for people to answer it for me. I will provide it a bit later once/if others have provided some answers.
My main goal for asking this question is to see how my solution compares to others given the requirements and constraints.
(Note the requirements carefully as I think they may invalidate some of the 'classic' solutions.)
Manuel Gonzalez noticed correctly that this is the general form of the famous Josephus problem.
If we are only interested in the survivor f(N,K) of a circle of size N and jumps of size K, then we can solve this with a very simple dynamic programming loop (In linear time and constant memory). Note that the ids start from 0:
int remaining(int n, int k) {
int r = 0;
for (int i = 2; i <= n; i++)
r = (r + k) % i;
return r;
}
It is based on the following recurrence relation:
f(N,K) = (f(N-1,K) + K) mod N
This relation can be explained by simulating the process of elimination, and after each elimination re-assigning new ids starting from 0. The old indices are the new ones with a circular shift of k positions. For a more detailed explanation of this formula, see http://blue.butler.edu/~phenders/InRoads/MathCounts8.pdf.
I know that the OP asks for all the indices of the eliminated items in their correct order. However, I believe that the above insight can be used for solving this as well.
You can do it using a boolean array.
Here is a pseudo code:
Let alive be a boolean array of size N. If alive[i] is true then ith person is alive else dead. Initially it is true for every 1>=i<=N
Let numAlive be the number of persons alive. So numAlive = N at start.
i = 1 # Counting starts from 1st person.
count = 0;
# keep looping till we've more than 1 persons.
while numAlive > 1 do
if alive[i]
count++
end-if
# time to kill ?
if count == K
print Person i killed
numAlive --
alive[i] = false
count = 0
end-if
i = (i%N)+1 # Counting starts from next person.
end-while
# Find the only alive person who is the winner.
while alive[i] != true do
i = (i%N)+1
end-while
print Person i is the winner
The above solution is space efficient but not time efficient as the dead persons are being checked.
To make it more efficient time wise you can make use of a circular linked list. Every time you kill a person you delete a node from the list. You continue till a single node is left in the list.
The problem of determining the 'kth' person is called the Josephus Problem.
Armin Shams-Baragh from Ferdowsi University of Mashhad published some formulas for the Josephus Problem and the extended version of it.
The paper is available at: http://www.cs.man.ac.uk/~shamsbaa/Josephus.pdf
This is my solution, coded in C#. What could be improved?
public class Person
{
public Person(int n)
{
Number = n;
}
public int Number { get; private set; }
}
static void Main(string[] args)
{
int n = 10; int k = 4;
var circle = new List<Person>();
for (int i = 1; i <= n; i++)
{
circle.Add(new Person(i));
}
var index = 0;
while (circle.Count > 1)
{
index = (index + k - 1) % circle.Count;
var person = circle[index];
circle.RemoveAt(index);
Console.WriteLine("Removed {0}", person.Number);
}
Console.ReadLine();
}
Console.WriteLine("Winner is {0}", circle[0].Number);
Essentially the same as Ash's answer, but with a custom linked list:
using System;
using System.Linq;
namespace Circle
{
class Program
{
static void Main(string[] args)
{
Circle(20, 3);
}
static void Circle(int k, int n)
{
// circle is a linked list representing the circle.
// Each element contains the index of the next member
// of the circle.
int[] circle = Enumerable.Range(1, k).ToArray();
circle[k - 1] = 0; // Member 0 follows member k-1
int prev = -1; // Used for tracking the previous member so we can delete a member from the list
int curr = 0; // The member we're currently inspecting
for (int i = 0; i < k; i++) // There are k members to remove from the circle
{
// Skip over n members
for (int j = 0; j < n; j++)
{
prev = curr;
curr = circle[curr];
}
Console.WriteLine(curr);
circle[prev] = circle[curr]; // Delete the nth member
curr = prev; // Start counting again from the previous member
}
}
}
}
Here is a solution in Clojure:
(ns kthperson.core
(:use clojure.set))
(defn get-winner-and-losers [number-of-people hops]
(loop [people (range 1 (inc number-of-people))
losers []
last-scan-start-index (dec hops)]
(if (= 1 (count people))
{:winner (first people) :losers losers}
(let [people-to-filter (subvec (vec people) last-scan-start-index)
additional-losers (take-nth hops people-to-filter)
remaining-people (difference (set people)
(set additional-losers))
new-losers (concat losers additional-losers)
index-of-last-removed-person (* hops (count additional-losers))]
(recur remaining-people
new-losers
(mod last-scan-start-index (count people-to-filter)))))))
Explanation:
start a loop, with a collection of people 1..n
if there is only one person left, they are the winner and we return their ID, as well as the IDs of the losers (in order of them losing)
we calculate additional losers in each loop/recur by grabbing every N people in the remaining list of potential winners
a new, shorter list of potential winners is determined by removing the additional losers from the previously-calculated potential winners.
rinse & repeat (using modulus to determine where in the list of remaining people to start counting the next time round)
This is a variant of the Josephus problem.
General solutions are described here.
Solutions in Perl, Ruby, and Python are provided here. A simple solution in C using a circular doubly-linked list to represent the ring of people is provided below. None of these solutions identify each person's position as they are removed, however.
#include <stdio.h>
#include <stdlib.h>
/* remove every k-th soldier from a circle of n */
#define n 40
#define k 3
struct man {
int pos;
struct man *next;
struct man *prev;
};
int main(int argc, char *argv[])
{
/* initialize the circle of n soldiers */
struct man *head = (struct man *) malloc(sizeof(struct man));
struct man *curr;
int i;
curr = head;
for (i = 1; i < n; ++i) {
curr->pos = i;
curr->next = (struct man *) malloc(sizeof(struct man));
curr->next->prev = curr;
curr = curr->next;
}
curr->pos = n;
curr->next = head;
curr->next->prev = curr;
/* remove every k-th */
while (curr->next != curr) {
for (i = 0; i < k; ++i) {
curr = curr->next;
}
curr->prev->next = curr->next;
curr->next->prev = curr->prev;
}
/* announce last person standing */
printf("Last person standing: #%d.\n", curr->pos);
return 0;
}
Here's my answer in C#, as submitted. Feel free to criticize, laugh at, ridicule etc ;)
public static IEnumerable<int> Move(int n, int k)
{
// Use an Iterator block to 'yield return' one item at a time.
int children = n;
int childrenToSkip = k - 1;
LinkedList<int> linkedList = new LinkedList<int>();
// Set up the linked list with children IDs
for (int i = 0; i < children; i++)
{
linkedList.AddLast(i);
}
LinkedListNode<int> currentNode = linkedList.First;
while (true)
{
// Skip over children by traversing forward
for (int skipped = 0; skipped < childrenToSkip; skipped++)
{
currentNode = currentNode.Next;
if (currentNode == null) currentNode = linkedList.First;
}
// Store the next node of the node to be removed.
LinkedListNode<int> nextNode = currentNode.Next;
// Return ID of the removed child to caller
yield return currentNode.Value;
linkedList.Remove(currentNode);
// Start again from the next node
currentNode = nextNode;
if (currentNode== null) currentNode = linkedList.First;
// Only one node left, the winner
if (linkedList.Count == 1) break;
}
// Finally return the ID of the winner
yield return currentNode.Value;
}

Cummulative array summation using OpenCL

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}