int main() {
char** hMat,* dArr;
hMat = new char*[10];
for (int i=0;i<10;i++) {
hMat[i] = new char[10];
}
cudaMalloc((void**)&dArr,100);
// Copy from dArr to hMat here:
}
I have an array, dArr on the GPU, and I want to copy it into a 2D array hMat on the host, where the first 10 fields in the GPU array are copied to the first row in the host matrix, and the next 10 fields are copied to the second row, and so on.
There are some functions in the documentation, namely CudaMemcpy2D and CudaMemcpy2DFromArray, but I'm not quite sure how they should be used.
Your allocation scheme (an array of pointers, separately allocated) has the potential to create a discontiguous allocation on the host. There are no cudaMemcpy operations of any type (including the ones you mention) that can target an arbitrarily discontiguous area, which your allocation scheme has the potential to create.
In a nutshell, then, your approach is troublesome. It can be made to work, but will require a loop to perform the copying -- essentially one cudaMemcpy operation per "row" of your "2D array". If you choose to do that, presumably you don't need help. It's quite straightforward.
What I will suggest is that you instead modify your host allocation to create an underlying contiguous allocation. Such a region can be handled by a single, ordinary cudaMemcpy call, but you can still treat it as a "2D array" in host code.
The basic idea is to create a single allocation of the correct overall size, then to create a set of pointers to specific places within the single allocation, where each "row" should start. You then reference into this pointer array using your initial double-pointer.
Something like this:
#include <stdio.h>
typedef char mytype;
int main(){
const int rows = 10;
const int cols = 10;
mytype **hMat = new mytype*[rows];
hMat[0] = new mytype[rows*cols];
for (int i = 1; i < rows; i++) hMat[i] = hMat[i-1]+cols;
//initialize "2D array"
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
hMat[i][j] = 0;
mytype *dArr;
cudaMalloc(&dArr, rows*cols*sizeof(mytype));
//copy to device
cudaMemcpy(dArr, hMat[0], rows*cols*sizeof(mytype), cudaMemcpyHostToDevice);
//kernel call
//copy from device
cudaMemcpy(hMat[0], dArr, rows*cols*sizeof(mytype), cudaMemcpyDeviceToHost);
return 0;
}
Related
Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.
The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.
There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.
Now I can follow an example similar to what is in the manual (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) such as
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.
Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?
When using dynamic shared memory with CUDA, there is one and only one pointer passed to the kernel, which defines the start of the requested/allocated area in bytes:
extern __shared__ char array[];
There is no way to handle it differently. However this does not prevent you from having two user-sized arrays. Here's a worked example:
$ cat t501.cu
#include <stdio.h>
__global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){
extern __shared__ char array[];
double *my_ddata = (double *)array;
char *my_cdata = arr1_sz*sizeof(double) + array;
for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f;
for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i;
printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]);
}
int main(){
unsigned double_array_size = 256;
unsigned char_array_size = 128;
unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char));
my_kernel<<<1,1, shared_mem_size>>>(256, 128);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t501 t501.cu
$ cuda-memcheck ./t501
========= CUDA-MEMCHECK
at offset 10, arr1: 11.000000, arr2: 10
========= ERROR SUMMARY: 0 errors
$
If you have a random arrangement of arrays of mixed data types, you'll want to either manually align your array starting points (and request enough shared memory) or else use alignment directives (and be sure to request enough shared memory), or use structures to help with alignment.
I'm using cub::DeviceScan functiona and the sample code snippet has a parameter temp_storage_bytes, which it uses to allocate memory (which, incidentally, the code snippet never frees).
The code snippet calls cub::DeviceScan functions with a pointer to NULL memory which triggers it to calculate the required amount of temporary device memory needed for the function, and then returns. The necessary temporary memory is allocated with cudaMalloc, and the function call is repeated pointing to this memory. The temporary memory is then freed with cudaFree (or probably should be).
I'm doing many repetitions of the device scan on different float arrays, but each float array is identical length.
My question is, can I assume that temp_storage_bytes will always be the same value? If so, I can then do a single cudaMalloc and a single cudaFree for many function calls.
The example is unclear on how the required memory is determined and whether it can change for a given array of a given length.
You can assume you need only one call to cub::DeviceScan::InclusiveScan to determine the amount of temporary temp_storage_bytes bytes required if you have repeated calls to cub::DeviceScan::InclusiveScan over different arrays of same length. In the example below, I'm calling several times cub::DeviceScan::InclusiveScan over different arrays of same length and using only one call to cub::DeviceScan::InclusiveScan to determine the amount of temporary size-
// Ensure printing of CUDA runtime errors to console
#define CUB_STDERR
#include <stdio.h>
#include <algorithm> // std::generate
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
#include <thrust\device_vector.h>
#include <thrust\host_vector.h>
void main(void)
{
// Declare, allocate, and initialize device pointers for input and output
int num_items = 7;
thrust::device_vector<int> d_in(num_items);
thrust::device_vector<int> d_out(num_items);
// Determine temporary device storage requirements for inclusive prefix sum
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in.data(), d_out.data(), num_items);
// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);
for (int k=0; k<10; k++) {
thrust::host_vector<int> h_in(num_items);
thrust::host_vector<int> h_out(num_items,0);
std::generate(h_in.begin(), h_in.end(), rand);
d_in = h_in;
// Run inclusive prefix sum
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in.data(), d_out.data(), num_items);
int difference = 0;
int prev = 0;
for (int i=0; i<num_items; i++) {
h_out[i] = prev + h_in[i];
prev = h_out[i];
int val = d_out[i];
printf("%i %i %i %i\n",i,difference,h_out[i],d_out[i]);
difference = difference + abs(h_out[i] - d_out[i]);
}
if (difference == 0) printf("Test passed!\n");
else printf("A problem occurred!\n");
h_in.shrink_to_fit();
h_out.shrink_to_fit();
}
getchar();
}
I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).
In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.
Here an example:
CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter
// Invoke kernel with pointer_dev as parameter. Now it should contain some results
CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel
cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction
What am I doing wrong?
EDIT:
Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link
Any workaround? I'm using CUDA Toolkit v5.0
Here we are copying a two dimensional array of integers from the device to host.
First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).
CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
for (int i = 0; i < blockSizeX; i++)
{
hostDevicePointers[i] = new CUdeviceptr();
cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
}
Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.
CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);
Launch the kernel.
kernelLauncher.call(........, hostDevicePointersArray);
Transfer the output from the device to host.
int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
for (int j = 0; j < size; j++)
{
sum = sum + hostOutputData[j];
}
I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}