Are atomic operations in CUDA guaranteed to be scheduled per warp? - cuda

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.

The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

Related

The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.
I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by #johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.
However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.
The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.
const int wkBlocksPerEnv = 2;
__device__ int env_sync_block_count[wkNumberEnvs];
__device__ void syncthreads_for_env(){
// sync threads for each block so all threads in this block finished the previous tasks
__syncthreads();
// sync threads for wkBlocksPerEnv blocks for each environment
if(wkBlocksPerEnv > 1){
const int kThisEnvId = get_env_scope_block_id(blockIdx.x);
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
// *** I can print here 1
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
// *** I can print here 2
}
// Do the next job ...
}
}
There are two potential issues with your code. Caching and block scheduling.
Caching can prevent you from observing an updated value during the while loop.
Block scheduling can cause a dead-lock if you wait for an update of a block which has not yet been scheduled. Since CUDA does not guarantee a specific order of scheduled blocks, the only way to prevent this dead-lock is to limit the number of blocks in the grid such that all blocks can run simultaneously.
Following code shows how you could synchronize multiple blocks while avoiding above issues. I adapted the code from the multi-grid synchronization given in the CUDA-sample conjugateGradientMultiDeviceCG https://github.com/NVIDIA/cuda-samples/blob/master/Samples/4_CUDA_Libraries/conjugateGradientMultiDeviceCG/conjugateGradientMultiDeviceCG.cu#L186
On pre-Volta devices, it uses volatile memory accesses. Volta and later uses acquire/release semantics.
Grid size is limited by querying device properties.
#include <cassert>
#include <cstdio>
constexpr int wkBlocksPerEnv = 13;
__device__
int getEnv(int blockId){
return blockId / wkBlocksPerEnv;
}
__device__
int getRankInEnv(int blockId){
return blockId % wkBlocksPerEnv;
}
__device__
unsigned char load_arrived(unsigned char *arrived) {
#if __CUDA_ARCH__ < 700
return *(volatile unsigned char *)arrived;
#else
unsigned int result;
asm volatile("ld.acquire.gpu.global.u8 %0, [%1];"
: "=r"(result)
: "l"(arrived)
: "memory");
return result;
#endif
}
__device__
void store_arrived(unsigned char *arrived,
unsigned char val) {
#if __CUDA_ARCH__ < 700
*(volatile unsigned char *)arrived = val;
#else
unsigned int reg_val = val;
asm volatile(
"st.release.gpu.global.u8 [%1], %0;" ::"r"(reg_val) "l"(arrived)
: "memory");
// Avoids compiler warnings from unused variable val.
(void)(reg_val = reg_val);
#endif
}
#if 0
//wrong implementation which does not synchronize. to check that kernel assert does trigger without proper synchronization
__device__
void syncthreads_for_env(unsigned char* temp){
}
#else
//temp must have at least size sizeof(unsigned char) * total_number_of_blocks in grid
__device__
void syncthreads_for_env(unsigned char* temp){
__syncthreads();
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
unsigned char* const mytemp = temp + env * wkBlocksPerEnv;
if(threadIdx.x == 0){
if(blockInEnv == 0){
// Leader block waits for others to join and then releases them.
// Other blocks in env can arrive in any order, so the leader have to wait for
// all others.
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
while (load_arrived(&mytemp[i]) == 0)
;
}
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
store_arrived(&mytemp[i], 0);
}
__threadfence();
}else{
// Other blocks in env note their arrival and wait to be released.
store_arrived(&mytemp[blockInEnv - 1], 1);
while (load_arrived(&mytemp[blockInEnv - 1]) == 1)
;
}
}
__syncthreads();
}
#endif
__global__
void kernel(unsigned char* synctemp, int* array){
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
if(threadIdx.x == 0){
array[blockIdx.x] = 1;
}
syncthreads_for_env(synctemp);
if(threadIdx.x == 0){
int sum = 0;
for(int i = 0; i < wkBlocksPerEnv; i++){
sum += array[env * wkBlocksPerEnv + i];
}
assert(sum == wkBlocksPerEnv);
}
}
int main(){
const int smem = 0;
const int blocksize = 128;
int deviceId = 0;
int numSMs = 0;
int maxBlocksPerSM = 0;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, deviceId);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxBlocksPerSM,
kernel,
blocksize,
smem
);
int maxBlocks = maxBlocksPerSM * numSMs;
maxBlocks -= maxBlocks % wkBlocksPerEnv; //round down to nearest multiple of wkBlocksPerEnv
printf("wkBlocksPerEnv %d, maxBlocks: %d\n", wkBlocksPerEnv, maxBlocks);
int* d_array;
unsigned char* d_synctemp;
cudaMalloc(&d_array, sizeof(int) * maxBlocks);
cudaMalloc(&d_synctemp, sizeof(unsigned char) * maxBlocks);
cudaMemset(d_synctemp, 0, sizeof(unsigned char) * maxBlocks);
kernel<<<maxBlocks, blocksize>>>(d_synctemp, d_array);
cudaFree(d_synctemp);
cudaFree(d_array);
return 0;
}
Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.
Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.
I like to do CUDA synchronization for multiple blocks.
You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
// OH NO!!
{
}
}
This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as #AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.
Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.
Now, #AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).
a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

CUDA: Max of array, how to prevent write collisions?

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global maximum", and so on.
But they never seem to address the issue of threads trying to write to the same memory position simultaneously.
Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?
I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )
The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.
For completeness, here is a concrete example of a reduction kernel:
template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
__shared__ volatile T data[BLOCKSIZE];
T maxval = inputvals[threadIdx.x];
for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x)
{
maxfunc(maxval, inputvals[i]);
}
data[threadIdx.x] = maxval;
__syncthreads();
// Here maxfunc(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
maxfunc(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) {
maxfunc(data[0], data[1]);
outputvals[blockIdx.x] = data[0];
}
}
}
The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as
template<T>
__device__ void maxfunc(T & a, T & b)
{
a = (b > a) ? b : a;
}
Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
int main(void)
{
thrust::host_vector<int> h_vec(10000);
thrust::sequence(h_vec.begin(), h_vec.end());
// show hvec
thrust::copy(h_vec.begin(), h_vec.end(),
std::ostream_iterator<int>(std::cout, "\n"));
// transfer to device
thrust::device_vector<int> d_vec = h_vec;
int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end());
std::cout << "max value: " << max_dvec_value << "\n";
return 0;
}
And watch out that thrust::max_element returns a pointer.
Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.