Cummulative array summation using OpenCL - cuda

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}

Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}

Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}

Related

find nearest non-zero element in another vector in CUDA

There is a M x N matrix A and B.(Actual size of matrix is 512 x 4096)
In each row of A, the points to be processed are set to 1.
And each row of B contains values obtained through a specific operation.
Based on each row, I am going to do an operation to get the value of B that is closest to the point of 1 in A.
The example is shown in the figure below, and the code I wrote in MATLAB was also written down.
Here's how I thought of it:
Pick the non-zero element index of A with thrust. And for each element, the closest value is fetched from the corresponding row of B by for-loop.
(If there are several non-zero elements in A, it is expected to be slow.)
I want to make good use of the power of the GPU for this operation, do you have any more efficient ideas?
[idxY,idxX] = find(A == 1);
for Point = 1:length(idxY)
pointBuf = find(B(:,idxY(Point)) == 1); // find non-zero elements in Row of B
if ~isempty(pointBuf) // there are non-zero elements in Row of B
[MinValue, MinIndex] = min(abs(pointBuf - idxY(Point)));
C(idxY(Point),idxX(Point)) = B(pointBuf(MinIndex(1)),RangeInd(Point)); // Get closest point in B
else
C(DopInd(Point),RangeInd(Point)) = 0; // if there is no non-zero elements in Row of B, just set to 0
end
end
Just as reference a solution, which shifts left and right by 4095. It has similarities with bubble sort variants, which bubble up and down at the same time.
Advantage is that it does not depend on the position of the non-null elements in B and can be easily parallelized between threads.
But the inner loop, which translates to 2 SASS instructions is still just too slow (too often called): The program takes 26ms on my notebook.
It would do so in the best and the absolute worst case of the input matrices.
Parts and methods of it probably can be reused, as it shows some CUDA programming methods.
So more or less for reference, in the end not a final (fast enough) solution:
__global__ void calcmatrix(bool* A, double* B, double* C)
{
// calculate row number
int row = blockDim.x * gridDim.y + threadIdx.y;
if (row >= 512)
return;
// store index of valid double from B, this is moved up and down
// those indices are for the current thread. Each thread is responsible for 128 consecutive columns.
int indices[128];
// prefill the indices with their own number (as if every double from B is valid)
#pragma unroll
for (int i = 0; i < 128; i++)
indices[i] = threadIdx.x * 128 + i;
// Store zero flags (4 * 32 bits) for our 128 elements
unsigned int myzeroflags[4];
// For efficiently loading data from memory, we distribute the data in another way: thread 0 gets columns 0, 32, 64, 96, ...; thread 1 gets columns 1, 33, 65, 97, ...; thread 2 gets columns 2, 34, 66, 98, ...; and so on
#pragma unroll
for (int i = 0; i < 128; i++) {
// load value from B
double in = B[row * 4096 + i * 32 + threadIdx.x];
// compare to zero (!in) and combine all bool results from the 32 threads (__ballot_sync))
unsigned int zeroflag = __ballot_sync(0xFFFFFFFF, !in);
// store the ones, which belong to us
if (threadIdx.x == i / 4)
myzeroflags[i & 3] = zeroflag;
}
// go through our zero flags and set those indices to -1 (there is already a valid index "0", so we use a negative number to signify invalid)
#pragma unroll
for (int i = 0; i < 4; i++)
#pragma unroll
for (int j = 0; j < 32; j++)
if (myzeroflags[i] & (1 << j))
indices[i * 32 + j] = -1;
// main loop, do 4095 times
#pragma unroll 1
for (int i = 0; i < 4095; i++) {
// move all elements to the left (if the index there is invalid)
// send index over thread boundaries
int fromright = __shfl_down_sync(0xFFFFFFFF, indices[0], 1, 32);
#pragma unroll
// if left index is -1, set it to one index to the right
for (int j = 0; j < 127; j++)
if (indices[j] == -1)
indices[j] = indices[j + 1];
// move over thread boundaries (except for the rightmost thread)
if (threadIdx.x != 31 && indices[127] == -1)
indices[127] = fromright;
// move to the right in the same way as to the left
int fromleft = __shfl_up_sync(0xFFFFFFFF, indices[127], 1, 32);
#pragma unroll
for (int j = 127; j > 0; j--)
if (indices[j] == -1)
indices[j] = indices[j - 1];
if (threadIdx.x != 0 && indices[0] == -1)
indices[0] = fromleft;
}
// for the other distribution of elements for memory accesses, we have to redistribute the indices to the correct threads
// To not have bank conflicts, we define the shared memory array with 33 instead of 32 elements in the last dimension, but use only 32. With this method we can put threadIdx.x into the last and previous to last dimension without bank conflicts
__shared__ short2 distribidx[8][32][33];
int indices2[128];
// Redistribute first half; the index can go from 0..4095 (and also theoreticially -1, if there was no non-null element in this row). This fits into a short, convert for faster transfer
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i]), static_cast<short>(indices[i + 32]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 0] = idxback.x;
indices2[4 * i + 1] = idxback.y;
}
__syncwarp();
// Redistribute second half
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i + 64]), static_cast<short>(indices[i + 96]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 2] = idxback.x;
indices2[4 * i + 3] = idxback.y;
}
// Do final calculation
#pragma unroll
for (int i = 0; i < 128; i++) {
// Default value is zero
double result = 0;
// Read only, if A is true and indices2 is valid
if (A[row * 4096 + i * 32 + threadIdx.x] && indices2[i] != -1)
// Read B with calculated index (this read is not optimized/coalesced, because the indices can be wild, but hopefully was or can be cached)
result = B[row * 4096 + indices2[i]];
// Store result in C
C[row * 4096 + i * 32 + threadIdx.x] = result;
}
}
int main()
{
bool* A;
double* B;
double* C;
cudaMalloc(&A, 2 * 512 * 4096);
cudaMalloc(&B, 8 * 512 * 4096);
cudaMalloc(&C, 8 * 512 * 4096);
// called in this fashion
calcmatrix<<<(512 + 7) / 8, dim3(32, 8)>>>(A, B, C);
return 0;
}
This problem is really far from being simple to implement efficiently on a GPU. The main reason is that GPUs are designed to efficiently execute SIMD-friendly algorithm while this problem can hardly be solve in a SIMD friendly way.
The naive solution you propose will be very inefficient due to the many small kernels to execute (starting a kernel is expensive and Thrust tends to run them synchronously by default AFAIK), not to mention the amount of parallelism of each kernel would be far too small for any modern GPU. I expect this solution to be slower than a naive CPU implementation.
First things first, one need to find an efficient algorithm. The proposed solution runs in O(n m²) where n is the number of row and m the number of columns. That being said, the solution should be fast (ie. close to O(n m)) if most values are non-zero which is not the case in the example.
A more efficient solution is to first iterate over the B matrix and find the location of all the non-zero items so to put it in an array L. Then you can iterate over A, track the non-zero values and search for the closest index of L matching to the location of the current item in A. If the number of items in L is big for the target row (eg. >50), you can use a binary search so to find the location faster (since items of L are sorted). This solution runs in O(n m log m) time.
An even better solution is to iterate simultaneously over A and L like a merge algorithm. Indeed, the indices of A and the items of B are both sorted so the binary search is not even needed. When the index of the current non-zero item of A is bigger than the current item of L you can iterate to the next value of L (and memorize the last value of L discarded needed to compute the closest value). This algorithm runs in O(n m) (optimal). An efficient CPU implementation consists in computing chunks of raw in each many threads.
On a GPU, things are more complex since all the previously provided algorithm are not SIMD-friendly. Computing a row in an SIMD-friendly way turns out to be complex and generally inefficient (the overhead can be higher than the serial algorithm on a CPU). One possible solution would be to compute rows in parallel (1 thread per row) and transpose the matrix block per block in shared-memory so to perform SIMD-friendly memory accesses after that (assuming there is enough space). The non-zero values of A and B certainly needs to be extracted first so to avoid thread divergence as much as possible. This solution works only if the number of non-zero is relatively uniform between the lines (otherwise I doubt a GPU can actually be helpful). Note the overhead of the transposition can be significant compared to the computation. Thus, I am not sure it will be faster than a CPU based solution. In fact, if data lies on the CPU memory, then just transferring data to the GPU will certainly be more expensive than computing the result on a CPU in parallel.

CUDA: Copy 1D array from GPU to 2D array on host

int main() {
char** hMat,* dArr;
hMat = new char*[10];
for (int i=0;i<10;i++) {
hMat[i] = new char[10];
}
cudaMalloc((void**)&dArr,100);
// Copy from dArr to hMat here:
}
I have an array, dArr on the GPU, and I want to copy it into a 2D array hMat on the host, where the first 10 fields in the GPU array are copied to the first row in the host matrix, and the next 10 fields are copied to the second row, and so on.
There are some functions in the documentation, namely CudaMemcpy2D and CudaMemcpy2DFromArray, but I'm not quite sure how they should be used.
Your allocation scheme (an array of pointers, separately allocated) has the potential to create a discontiguous allocation on the host. There are no cudaMemcpy operations of any type (including the ones you mention) that can target an arbitrarily discontiguous area, which your allocation scheme has the potential to create.
In a nutshell, then, your approach is troublesome. It can be made to work, but will require a loop to perform the copying -- essentially one cudaMemcpy operation per "row" of your "2D array". If you choose to do that, presumably you don't need help. It's quite straightforward.
What I will suggest is that you instead modify your host allocation to create an underlying contiguous allocation. Such a region can be handled by a single, ordinary cudaMemcpy call, but you can still treat it as a "2D array" in host code.
The basic idea is to create a single allocation of the correct overall size, then to create a set of pointers to specific places within the single allocation, where each "row" should start. You then reference into this pointer array using your initial double-pointer.
Something like this:
#include <stdio.h>
typedef char mytype;
int main(){
const int rows = 10;
const int cols = 10;
mytype **hMat = new mytype*[rows];
hMat[0] = new mytype[rows*cols];
for (int i = 1; i < rows; i++) hMat[i] = hMat[i-1]+cols;
//initialize "2D array"
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
hMat[i][j] = 0;
mytype *dArr;
cudaMalloc(&dArr, rows*cols*sizeof(mytype));
//copy to device
cudaMemcpy(dArr, hMat[0], rows*cols*sizeof(mytype), cudaMemcpyHostToDevice);
//kernel call
//copy from device
cudaMemcpy(hMat[0], dArr, rows*cols*sizeof(mytype), cudaMemcpyDeviceToHost);
return 0;
}

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

Parallel reduction and find index on CUDA

I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400.
My code looks like this:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
The problem/issue here is that at some point “nTotalThreads” is not exactly a power of 2, resulting in garbage value for the index. The array out1 gives me the maximum value in each block, which is correct and validated. But the value of the index is wrong. For example: the max value in the first block occurs at index=40, but the kernel gives the values of index as 15. Similarly the value of the max in the second block is at 440, but the kernel gives 416.
Any suggestions??
It should be easy to ensure that nTotalThreads is always a power of 2.
Make the first reduction a special case that gets the nTotalThreads to a power of 2. eg, since you start with 400 threads in a block, do the first reduction with 256 threads. Threads 0-199 will reduce from two values, and threads 200-255 just won't have to do a reduction in this initial step. From then on out you'd be fine.
Are you sure you really need the 'issue' “nTotalThreads” is not exactly a power of 2?
It makes the code less readable and I think it can interfere with the performance too.
Anyway if you substitute
nTotalThreads = (nTotalThreads >> 1);
with
nTotalThreads = (nTotalThreads +1 ) >> 1;
it should solve one bug concerning this 'issue'.
Francesco
Second Jeff's suggestion.
Take a look at the CUDA Thrust Library's reduce function. This is demonstrated to have 95+% efficiency compared with heavily hand-tuned kernels and is pretty flexible and easy to use.
check my kernel. You can put your blockresults to array(which can be in global memory) and get the result in global memory
And see how I call it in host code:
sumSeries<<<dim3(blockCount),dim3(threadsPerBlock)>>>(deviceSum,threadsPerBlock*blockCount);