Sharing highly irregular job among CUDA threads - cuda

I’m working on some task related to graph traversal (Viterbi algorithm)
Each time step I have a compacted set of active states, some job is done in each state, and than results are propagated through outgoing arcs to each arc’s destination state and so new active set of states is built.
The problem is that number of outgoing arcs varies very heavily , from two or three to several thousands. So compute threads are loaded very ineffectively.
I try to share the job through shared local memory queue
int tx = threaIdx.x;
extern __shared__ int smem[];
int *stateSet_s = smem; //new active set
int *arcSet_s = &(smem[Q_LEN]); //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);
__shared__ int arcCnt;
__shared__ int stateCnt;
if ( tx == 0 )
{
arcCnt = 0;
stateCnt = 0;
}
__syncthreads();
//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];
float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];
int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)
/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?
int myPos = atomicAdd ( &arcCnt, nArcs );
while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
scores_s[myPos] = srcCost;
arcSet_s[myPos] = startId + nArcs - 1;
myPos++;
nArcs--;
}
__syncthreads();
//////////////////////////////////////
/// parallel propagate arc set
if ( arcSet_s[tx] > 0 )
{
FstArc arc = arcs_g[ arcSet_s[tx] ];
float srcCost_ = scores_s[tx];
DoSomeJob ( &srcCost_ );
int *dst = &(transitionData_g[arc.dst]);
int old = atomicMax( dst, FloatToInt ( srcCost_ ) );
////////////////////////////////
//// new active set
if ( old == ILZERO )
{
int pos = atomicAdd ( &stateCnt, 1 );
stateSet_s[ pos ] = arc.dst;
}
}
/////////////////////////////////////////////
/// transfer new active set from smem to gmem
__syncthreads();
__shared__ int gPos;
if ( tx == 0 )
{
gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}
__syncthreads();
if ( tx < stateCnt )
{
activeSetOut_g[gPos + tx] = stateSet_s[tx];
}
__syncthreads();
But it runs very slow, I mean slower then if no active set is used (active set = all states), though active set is 10 – 15 percent of all states. Register pressure raised heavily, occupancy is low, but I don’t think anything can be done about it.
May be there are more effective ways of job sharing among threads?
A think about warp-shuffle ops on 3.0, but I have to use 2.x devices.

Usually problems with uneven workload and dynamic work creation are addressed using multiple CUDA kernel calls. This can be done by making CPU loop like the following:
//CPU pseudocode
while ( job not done) {
doYourComputationKernel();
loadBalanceKernel();
}
doYourComputationKernel() must have an heuristic to know when it is a good time to stop and send control back to CPU to balance the workload. This can be done by using a global counter for the number of idle blocks. This counter is incremented every time a block finishes its work or cannot create more work. When the number of idle blocks exceed a threshold, the work in all blocks is saved to global memory and all blocks finish.
loadBalanceKernel() should receive the global array with all saved work and another global array of work counters per block. A reduce operation on the later can calculate the total number of works. With this the number of works per block can be found. Finally, the kernel should copy the work so every block receive the same number of elements.
The loop continues until all computation is done. There's a good paper about this: http://gamma.cs.unc.edu/GPUCOL/. The idea is to balance the load of continuous collision detection which is very uneven.

Related

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x. But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.
From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:
int id = threadIdx.x;
This id is independant of blockIdx, and therefore identical across blocks, no matter their number.
If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.
Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl. Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.

Cost of OpenCL get_local_id()

I have a simple scan kernel, which calculates scans of several blocks in a loop. I noticed that performance somewhat rises when get_local_id() is stored inside a local variable instead of calling it inside the loop. So to summarize with code, this:
__kernel void LocalScan_v0(__global const int *p_array, int n_array_size, __global int *p_scan)
{
const int n_group_offset = get_group_id(0) * SCAN_BLOCK_SIZE;
p_array += n_group_offset;
p_scan += n_group_offset;
// calculate group offset
const int li = get_local_id(0); // *** local id cached ***
const int gn = get_num_groups(0);
__local int p_workspace[SCAN_BLOCK_SIZE];
for(int i = n_group_offset; i < n_array_size; i += SCAN_BLOCK_SIZE * gn) {
LocalScan_SingleBlock(p_array, p_scan, p_workspace, li);
p_array += SCAN_BLOCK_SIZE * gn;
p_scan += SCAN_BLOCK_SIZE * gn;
}
// process all the blocks in the array (each block size SCAN_BLOCK_SIZE)
}
Has throughput of 74 GB/s on GTX-780, while this:
__kernel void LocalScan_v0(__global const int *p_array, int n_array_size, __global int *p_scan)
{
const int n_group_offset = get_group_id(0) * SCAN_BLOCK_SIZE;
p_array += n_group_offset;
p_scan += n_group_offset;
// calculate group offset
const int gn = get_num_groups(0);
__local int p_workspace[SCAN_BLOCK_SIZE];
for(int i = n_group_offset; i < n_array_size; i += SCAN_BLOCK_SIZE * gn) {
LocalScan_SingleBlock(p_array, p_scan, p_workspace, get_local_id(0));
// *** local id polled inside the loop ***
p_array += SCAN_BLOCK_SIZE * gn;
p_scan += SCAN_BLOCK_SIZE * gn;
}
// process all the blocks in the array (each block size SCAN_BLOCK_SIZE)
}
Has only 70 GB/s on the same hardware. The only difference is whether the call to get_local_id() is inside or outside of the loop. The code in LocalScan_SingleBlock() is pretty much described in this GPU Gems article.
Now this brings some questions. I always imagined that thread id is stored inside some register, and access to it is as fast as to any thread-local variable. This doesn't seem to be the case. I always used to have habit of caching the local id in a variable with reluctance of an old "C" programmer who wouldn't call a function in a loop, had he expect it to return the same value every time, but I didn't seriously think it would make any difference.
Any ideas as to why this might be? I didn't do any checking on the compiled binary code. Does anyone have the same experience? Is it the same with threadIdx.x in CUDA? How about ATI platforms? Is this behavior described somewhere? I quickly scanned through CUDA Best Practices, but didn't find anything.
This is just a guess, but as per the Khronos page
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/get_local_id.html
get_local_id() isn't defined to return a constant value (merely size_t). That may mean that, as far as the compiler is aware, it may not be allowed to perform certain optimisations compared with a constant local_id because the return of the function value may change in the eyes of the compiler (even though it wont per-thread)

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

CUDA block synchronization differences between GTS 250 and Fermi devices

So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).
Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.
I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence() should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.
I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.
Relevant code below:
__device__ void lock(int *mutex) {
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex) {
atomicExch(mutex, 0);
}
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
__threadfence();
unlock(globalFreeLock);
Entry newEntry;
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
table.entries[hashValue] = newLocation;
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
}
__threadfence();
unlock(&globalHashLocks[hashValue]);
}
As pointed out by LSChien in this post, the issue is with L1 cache coherency. While using __threadfence() will guarantee shared and global memory writes are visible to other threads, since it is not atomic, thread x in block 1 may reach a cached memory value until thread y in block 0 has executed to the threadfence instruction. Instead LSChien suggested a hack in his post of using an atomicCAS() to force the thread to read from global memory instead of a cached value. The proper way to do this is by declaring the memory as volatile, requiring that every write to that memory be visible to all other threads in the grid immediately.
__threadfence guarantees that writes to global memory are visible to other threads in the current block before returning. That is not the same as "write operation on global memory is complete"! Think caching on each multicore.