Issues with CUDA streams - cuda

I am running CUBLAS v2.0 on different streams on a single GPU (Tesla C2050) by subdividing the input matrices (A[x/num_of_streams*y]B[xy] = C[x/num_of_streams*y]), but somehow it is taking more time when I use CUDA streams. Here is the code snippet:
//plan is a struct containing the matrix dimensions and stream numbers
//parallel in nstreams - should be! MAX 16 streams could run concurrently
//Copy A - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
//Copy B - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
//Create handles - serial
for(i = 0; i < nstreams; i++)
handle[i] = create_handle();
//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm...
for(i = 0; i < nstreams; i++)
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//Destroy handles - serial
for(i = 0; i < nstreams; i++)
destroy_handle (handle[i]);
//Copy C - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyC_in_streams (&plan[i]);
//EDIT: Function body
//The other two copy functions are exactly the same as this
void cudgemm_copyA_in_streams(TGPUplan *plan)
{
cudasafe(cudaMemcpyAsync(plan->Ad_Data, plan->Ah_Data, (plan->Acols * plan->Arows * sizeof(double)), cudaMemcpyHostToDevice, plan->stream) );
}
//Create handle
cublasHandle_t create_handle ()
{
cublasHandle_t handle;
checkError(cublasCreate(&handle), "cublasCreate() error!\n");
return handle;
}
//Destroy handle
void destroy_handle (cublasHandle_t handle)
{
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
//Kernel
void cudgemm_kernel_in_streams(TGPUplan *plan, cublasHandle_t handle, const double alpha, const double beta)
{
cublasStatus_t ret;
cublasSetStream(handle, plan->stream);
ret = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, plan->Arows, plan->Ccols, plan->Acols, &alpha, plan->Ad_Data, plan->Arows, plan->Bd_Data, plan->Brows, &beta, plan->Cd_Data, plan->Crows);
checkError(ret, "cublas Dgemm returned an error!\n");
}
So I am bouncing back and forth between streams and assigning work, expecting to get a better execution time, but I notice that more the number of streams, the program takes more time as compared to the version that does not uses stream. Where am I going wrong?
Cross post to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=209420
EDIT:
I modified my program as follows:
//copy data
for(i = 0; i < nstreams; i++)
{
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
//Run kernel and copy back
for(i = 0; i < nstreams; i++)
{
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
cudgemm_copyC_in_streams (&plan[i]);
}
When I profile my program for a matrix order of 6144, I get the following info:
Kernel time = 42.75 % of total GPU time
Memory copy time = 28.9 % of total GPU time
Kernel taking maximum time = fermiDgemm_v2_kernel_val (42.8% of total GPU time)
Memory copy taking maximum time = memcpyHtoDasync (21.7% of total GPU time)
Total overlap time in GPU = 65268.3 micro sec. (3.6% of total GPU time)
When I time the above loop, I get an time of 0.000284s, vs 1.703289s for the version that does not uses streams (in that version also, I time the two sequential memory copies, kernel invocation and the remaining memCpy).
I think since I am not using any synchronization constructs, may be I am printing the time before the computation actually finishes (I find it difficult to believe that there is a 100% improvement).

I suggest two changes:
1) move your cuBLAS handle creation/destruction to outside the copies and kernel invocations. It's possible it is breaking concurrency by doing an unneeded context synchronize.
2) do the memcpy's together in one loop over the streams. That way, the B copy of stream 0 does not do any extra synchronization to wait until the A memcpy has been completed. i.e. do this:
for(i = 0; i < nstreams; i++) {
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
not this:
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
Don't be surprised if you are unable to get a speedup of more than 40% or so from overlapping transfers and computation. Streams deliver the biggest benefits on workloads that spend equal time transferring and processing data, and very few workloads fall into that category.

I would also suggest to check the SIZE of the copies, you should start using different streams only
when the time to transfer one block of memory can be compared to the time needed to compute on it.
If the time to transfer is little compared to the computation time, then adding streams add more overhead with their management.
Use the Visual Profiler to see how long it takes the various steps. Make a graph with different memory inputs.

Related

Overlapping transfers and kernel executions in CUDA with two loops

I want to overlap data transfers and kernel executions in a form like this:
int numStreams = 3;
int size = 10;
for(int i = 0; i < size; i++) {
cuMemcpyHtoDAsync( _bufferIn1,
_host_memoryIn1 ),
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuMemcpyHtoDAsync( _bufferIn2,
_host_memoryIn2,
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuLaunchKernel( _kernel,
gs.x(), gs.y(), gs.z(),
bs.x(), bs.y(), bs.z(),
_memory_size,
cuda_stream[i % numStreams],
_kernel_arguments,
0
);
cuEventRecord(event[i], cuda_stream);
}
for(int i = 0; i < size; i++) {
cuEventSynchronize(events[i]);
cuMemcpyDtoHAsync( _host_memoryOut,
_bufferOut,
_size * sizeof(T),
cuda_stream[i % numStreams]);
}
Is overlapping possible in this case? Currently only the HtoD-transfers overlap with the kernel executions. The first DtoH-transfer is executed after the last kernel execution.
Overlapping is possible only when the operations are executed on different streams. CUDA operations in the same stream are executed sequentially by the host calling order so that the copy from the device to host at the end will be executed once all the operations on corresponding streams are completed. The overlap doesn't happen because both the last kernel and the first copy are executed on stream 0, so the copy has to wait for the kernel to finish. Since you are synchronizing with an event at each loop iteration, the other copies on the other streams (stream 1 and 2) are not called yet.

CUDA stream is blocked when launching many kernels (>1000)

I found that CUDA stream will block when I launch lots of kernels (more than 1000). I am wondering is there any configuration that I can change?
In my experiments, I launch a small kernel 10000 times. This kernel ran shortly (about 190us). The kernel launched very fast when launching the first 1000 kernels. It takes 4~5us to launch a kernel. But after that, The launch process becomes slow. It takes about 190us to launch a new kernel. The CUDA stream seems to wait for the previous kernel complete and the buffer size is about 1000 kernel.
When I created 3 streams, each stream can launch 1000 kernel asynchrony.
I want to make this buffer bigger. I try to set cudaLimitDevRuntimePendingLaunchCount, but it does not work. Is there any way?
#include <stdio.h>
#include "cuda_runtime.h"
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if (e != cudaSuccess) { \
printf("Failed: Cuda error %s:%d '%s'\n", \
__FILE__,__LINE__,cudaGetErrorString(e)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// a dummy kernel for test
__global__ void add(float *a, int n) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < n; i++) {
a[id] = sqrt(a[id] + 1);
}
}
int main(int argc, char* argv[])
{
// managing 1 devices
int nDev = 1;
int nStream = 1;
int size = 32*1024*1024;
// allocating and initializing device buffers
float** buffer = (float**)malloc(nDev * sizeof(float*));
cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev*nStream);
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
// CUDACHECK(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 10000));
CUDACHECK(cudaMalloc(buffer + i, size * sizeof(float)));
CUDACHECK(cudaMemset(buffer[i], 1, size * sizeof(float)));
for (int j = 0; j < nStream; j++) {
CUDACHECK(cudaStreamCreate(s+i*nStream+j));
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
for (int j=0; j < 10000; j++) {
for (int k=0; k < nStream; k++) {
add<<<32, 1024, 0, s[i*nStream+k]>>>(buffer[i], 1000);
}
}
}
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
cudaDeviceSynchronize();
}
// free device buffers
for (int i = 0; i < nDev; ++i) {
CUDACHECK(cudaSetDevice(i));
CUDACHECK(cudaFree(buffer[i]));
}
printf("Success \n");
return 0;
}
Here is the nvprof results:
When I create 3 streams, the first 3000 kernel launched quickly and then become slow
When I create 1 streams, the first 1000 kernel launched quickly and then become slow
The behavior you are witnessing is expected behavior. If you search on the cuda tag for "queue" or "launch queue" you will find many other questions that refer to it. CUDA has a queue (apparently per-stream) that kernel launches go into. As long as the outstanding launch count is less than the queue depth, the launch process will be asynchronous.
However when the outstanding (i.e. uncompleted) launches exceed the queue depth, the launch process changes to a kind of synchronous behavior (although not synchronous in the usual sense). Specifically, when the outstanding number of kernel launches exceeds the queue depth, the launch process will block the CPU thread that is performing the next launch, until a launch slot opens in the queue (effectively means a kernel has retired at the other end of the queue).
You have no visibility into this (no way to query the number of slots open in the queue) nor any way to view or control the queue depth. Most of the information I'm reciting here is obtained by inspection; it is not formally published in CUDA documentation that I am aware of.
As already discussed in the comments, one possible approach to alleviate your concern around launches in a multi-device scenario is to launch breadth-first rather than depth-first. By this I mean that you should modify your launch loops so that you launch a kernel to device 0, then device 1, then device 2, etc. before launching the next kernel on device 0. This will give you the optimum performance in the sense that all GPUs will be engaged with processing, as early as possible in the launch sequence.
If you'd like to see changes in CUDA behavior or documentation, the general suggestion is to become a registered developer at developer.nvidia.com, then log into your account there and file a bug, using the bug filing process accessible by clicking on your account name in the upper right hand corner.

CUDA reduction using registers

I need to calculate N signals' mean values using reduction. The input is a 1D array of size MN, where M is the length of each signal.
Originally I had additional shared memory to first copy the data and do the reduction on each signal. However, the original data is corrupted.
My program tries to minimize the shared memory. So I was wondering how I can use registers to do a reduction sum on N signals. I have N threads, a shared memory (float) s_m[N*M], 0....M-1 is the first signal, etc.
Do I need N registers (or one) to store do mean value of N different signals? (I know how to do with sequential addition using multi-thread programming and 1 register). The next step I want to do is subtract every value in the input from its correspondent signal's mean.
Your problem is very small (N = 32 and M < 128). However, some guidelines:
Assuming you are reducing across N values for each of N threads.
If N is very large (> 10s of thousands) large, just do the reductions over M sequentially in each thread.
If N is < 10s of thousands, consider using one warp or one thread block to perform each of the N reductions.
If N is very small but M is very large, consider using multiple thread blocks per each of the N reductions.
If N is very small and M is very small (as your numbers are), only consider using the GPU for the reductions if the computations that generate and / or consume the input / output of the reductions are also running on the GPU.
Based on my understanding of the question, I say that you don't need N registers to store the mean value of N different signals.
If you already have N threads [Given that each thread do reduction on only one signal], then you don't need N registers to store the reduction of one signal. All you need is one register to store the mean value.
dim3 threads (N,1);
reduction<<<threads,1>>>(signals); // signals is the [N*M] array
__global__ reduction (int *signals)
{
int id = threadIdx.x;
float meanValue = 0.0;
for(int i = 0; i < M; i++)
meanValue = signals[id*M +i];
meanValue = meanValue/M;
// Then do the subtraction
for(int i = 0; i < M; i++)
signals[id*M +i] -= meanValue;
}
If you need to do Kind of global reduction of all the meanValues of N different signals, then you need to use 2 registers [one to store the local mean and another to store the global mean] and the shared memory
dim3 threads (N,1);
reduction<<<threads,1>>>(signals); // signals is the [N*M] array
__global__ reduction (int *signals)
{
__shared__ float means[N]; // shared value
int id = threadIdx.x;
float meanValue = 0.0;
float globalMean = 0.0;
for(int i = 0; i < M; i++)
meanValue += signals[id*M +i];
means[id] = meanValue/M;
__syncthreads();
// do the global reduction
for(int i = 0; i < N; i++)
globalMean += means[i];
globalMean = globalMean/N;
// Then do the subtraction
for(int i = 0; i < M; i++)
signals[id*M +i] -= globalMean;
}
I hope this helps you. Any doubts, let me know.

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

Cummulative array summation using OpenCL

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}