Parallel Reduction in CUDA for calculating primes - cuda

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}

While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

Related

The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.
I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by #johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.
However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.
The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.
const int wkBlocksPerEnv = 2;
__device__ int env_sync_block_count[wkNumberEnvs];
__device__ void syncthreads_for_env(){
// sync threads for each block so all threads in this block finished the previous tasks
__syncthreads();
// sync threads for wkBlocksPerEnv blocks for each environment
if(wkBlocksPerEnv > 1){
const int kThisEnvId = get_env_scope_block_id(blockIdx.x);
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
// *** I can print here 1
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
// *** I can print here 2
}
// Do the next job ...
}
}
There are two potential issues with your code. Caching and block scheduling.
Caching can prevent you from observing an updated value during the while loop.
Block scheduling can cause a dead-lock if you wait for an update of a block which has not yet been scheduled. Since CUDA does not guarantee a specific order of scheduled blocks, the only way to prevent this dead-lock is to limit the number of blocks in the grid such that all blocks can run simultaneously.
Following code shows how you could synchronize multiple blocks while avoiding above issues. I adapted the code from the multi-grid synchronization given in the CUDA-sample conjugateGradientMultiDeviceCG https://github.com/NVIDIA/cuda-samples/blob/master/Samples/4_CUDA_Libraries/conjugateGradientMultiDeviceCG/conjugateGradientMultiDeviceCG.cu#L186
On pre-Volta devices, it uses volatile memory accesses. Volta and later uses acquire/release semantics.
Grid size is limited by querying device properties.
#include <cassert>
#include <cstdio>
constexpr int wkBlocksPerEnv = 13;
__device__
int getEnv(int blockId){
return blockId / wkBlocksPerEnv;
}
__device__
int getRankInEnv(int blockId){
return blockId % wkBlocksPerEnv;
}
__device__
unsigned char load_arrived(unsigned char *arrived) {
#if __CUDA_ARCH__ < 700
return *(volatile unsigned char *)arrived;
#else
unsigned int result;
asm volatile("ld.acquire.gpu.global.u8 %0, [%1];"
: "=r"(result)
: "l"(arrived)
: "memory");
return result;
#endif
}
__device__
void store_arrived(unsigned char *arrived,
unsigned char val) {
#if __CUDA_ARCH__ < 700
*(volatile unsigned char *)arrived = val;
#else
unsigned int reg_val = val;
asm volatile(
"st.release.gpu.global.u8 [%1], %0;" ::"r"(reg_val) "l"(arrived)
: "memory");
// Avoids compiler warnings from unused variable val.
(void)(reg_val = reg_val);
#endif
}
#if 0
//wrong implementation which does not synchronize. to check that kernel assert does trigger without proper synchronization
__device__
void syncthreads_for_env(unsigned char* temp){
}
#else
//temp must have at least size sizeof(unsigned char) * total_number_of_blocks in grid
__device__
void syncthreads_for_env(unsigned char* temp){
__syncthreads();
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
unsigned char* const mytemp = temp + env * wkBlocksPerEnv;
if(threadIdx.x == 0){
if(blockInEnv == 0){
// Leader block waits for others to join and then releases them.
// Other blocks in env can arrive in any order, so the leader have to wait for
// all others.
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
while (load_arrived(&mytemp[i]) == 0)
;
}
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
store_arrived(&mytemp[i], 0);
}
__threadfence();
}else{
// Other blocks in env note their arrival and wait to be released.
store_arrived(&mytemp[blockInEnv - 1], 1);
while (load_arrived(&mytemp[blockInEnv - 1]) == 1)
;
}
}
__syncthreads();
}
#endif
__global__
void kernel(unsigned char* synctemp, int* array){
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
if(threadIdx.x == 0){
array[blockIdx.x] = 1;
}
syncthreads_for_env(synctemp);
if(threadIdx.x == 0){
int sum = 0;
for(int i = 0; i < wkBlocksPerEnv; i++){
sum += array[env * wkBlocksPerEnv + i];
}
assert(sum == wkBlocksPerEnv);
}
}
int main(){
const int smem = 0;
const int blocksize = 128;
int deviceId = 0;
int numSMs = 0;
int maxBlocksPerSM = 0;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, deviceId);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxBlocksPerSM,
kernel,
blocksize,
smem
);
int maxBlocks = maxBlocksPerSM * numSMs;
maxBlocks -= maxBlocks % wkBlocksPerEnv; //round down to nearest multiple of wkBlocksPerEnv
printf("wkBlocksPerEnv %d, maxBlocks: %d\n", wkBlocksPerEnv, maxBlocks);
int* d_array;
unsigned char* d_synctemp;
cudaMalloc(&d_array, sizeof(int) * maxBlocks);
cudaMalloc(&d_synctemp, sizeof(unsigned char) * maxBlocks);
cudaMemset(d_synctemp, 0, sizeof(unsigned char) * maxBlocks);
kernel<<<maxBlocks, blocksize>>>(d_synctemp, d_array);
cudaFree(d_synctemp);
cudaFree(d_array);
return 0;
}
Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.
Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.
I like to do CUDA synchronization for multiple blocks.
You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
// OH NO!!
{
}
}
This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as #AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.
Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.
Now, #AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).
a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.
The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

How to collect individual results of the threads within a block?

In my Kernel, the threads are processing a small part of an array in global memory.
After processing I would also like to set a flag indicating that the result of the calculation is zero for all threads within a block:
__global__ void kernel( int *a, bool *blockIsNull) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int result = 0;
// {...} Here calculate result
a[tid] = result;
// some code here, but I don't know, that's my question...
if (condition)
blockIsNull[blockIdx.x] = true; // if all threads have returned result==0
}
Each individual thread owns the information. But I don't find an efficient way to collect it.
For example, I could have a counter in shared memory that is atomically incremented by each thread when result==0. So when the counter reaches blockDim.x it means that all threads have returned zero. Althought not tested, I am afraid that this solution will have a negative impact on performance (atomic functions are slow).
A zero result does not occur very often, so it is very unlikely to have zeros for all threads within a block. I would like to find a solution that has little impact on the performance in the general case.
What would be your recommendation ?
It sounds like you want to perform a block level reduction of the condition value across a block. Just about all CUDA hardware supports a set of very useful warp voting primitives. You could use the __all() warp vote to determine that each warp of threads satisfied the condition, and then use __all() again to check whether all warps satisfy the condition. In code, it might look like this:
__global__ void kernel( int *a, bool *blockIsNull) {
// assume that threads per block is <= 1024
__shared__ volatile int blockcondition[32];
int laneid = threadIdx.x % 32;
int warpid = threadIdx.x / 32;
// Set each condition value to non zero to begin
if (warpid == 0) {
blockcondition[threadIdx.x] = 1;
}
__syncthreads();
//
// your code goes here
//
// warpcondition holds the vote from each warp
int warpcondition = __all(condition);
// First thread in each warp loads the warp vote to shared memory
if (laneid == 0) {
blockcondition[warpid] = warpcondition;
}
__syncthreads();
// First warp reduces all the votes in shared memory
if (warpid == 0) {
int result = __all(blockcondition[threadIdx.x] != 0);
// first thread stores the block result to global memory
if (laneid == 0) {
blockIsNull[blockIdx.x] = (result !=0);
}
}
}
[ Huge disclaimer: written in browser, never compiled or tested, use at own risk ]
This code should (I think) work for any number of threads per block up to 1024. You could, if required, adjust the size of blockcondition to a smaller value if you were confident of an upper block size limit less than 1024. Probably the smartest way would be to use C++ templating and make the warp count limit a template parameter.

Implementing Max Reduce in Cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".