CUDA threads appear to be out of sync - cuda

I have an issue where it appears that a single thread is trailing behind the rest, even though i'm using syncthreads. The following extract is taken from a large program, where I've cut out as much as I can yet it still reproduces my problem. What I find is that upon running this code the test4 variable does not return the same value for all threads. My understanding is that using the TEST_FLAG variable it should lead all threads into the if (TEST_FLAG == 2) condition and therefore every element in the array test4 should return a value of 43. However what I find is that all elements return 43, except thread 0 which returns 0. It appears as if the threads are not all getting to the same syncthreads. I've performed numerous tests and I've found that removing more of the code, such as the for (l=0; l<1; ++l) loop resolves the issue, but I do not understand why. Any help as to why my threads are not all returning the same value would be greatly appreciated.
import numpy as np
import pycuda.driver as drv
import pycuda.compiler
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
from pycuda.compiler import SourceModule
gpu_code=SourceModule("""
__global__ void test_sync(double *test4, double *test5)
{
__shared__ double rad_loc[2], boundary[2], boundary_limb_edge[2];
__shared__ int TEST_FLAG;
int l;
if (blockIdx.x != 0)
{
return;
}
if(threadIdx.x == 0)
{
TEST_FLAG = 2;
boundary[0] = 1;
}
test4[threadIdx.x] = 0;
test5[threadIdx.x] = 0;
if (threadIdx.x == 0)
{
rad_loc[0] = 0.0;
}
__syncthreads();
for (l=0; l<1; ++l)
{
__syncthreads();
if (rad_loc[0] > 0.0)
{
test5[threadIdx.x] += 1;
if ((int)boundary[0] == -1)
{
__syncthreads();
continue;
}
}
else
{
if (threadIdx.x == 0)
{
boundary_limb_edge[0] = 0.0;
}
}
__syncthreads();
if (TEST_FLAG == 2)
{
test4[threadIdx.x] = 43;
__syncthreads();
TEST_FLAG = 99;
}
__syncthreads();
return;
}
return;
}
""")
test_sync = gpu_code.get_function("test_sync")
DATA_ROWS=[100,100]
blockshape_data_mags = (int(64),1, 1)
gridshape_data_mags = (int(sum(DATA_ROWS)), 1)
test4 = np.zeros([1*blockshape_data_mags[0]], np.float64)
test5 = np.zeros([1*blockshape_data_mags[0]], np.float64)
test_sync(drv.InOut(test4), drv.InOut(test5), block=blockshape_data_mags, grid=gridshape_data_mags)
print test4
print test5

As Yuuta mentioned, __syncthreads() behavior is not defined in conditional statements. Thus having it there may/may not work as expected. You may want to re-write your code to avoid getting __syncthreads() into your if conditions.
You may check this answer and this paper for more information on __syncthreads().
It is also important to notice that it is a block level barrier. You can't synchronize different blocks using __syncthreads(). Blocks must be synchronized by kernel calls.

Your problem is with the statement TEST_FLAG=99. For one of the threads, it is executed before thread 0 enters the conditional block, and gives you undefined behavior. If I comment out TEST_FLAG=99, the code runs as expected.

Related

The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.
I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by #johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.
However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.
The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.
const int wkBlocksPerEnv = 2;
__device__ int env_sync_block_count[wkNumberEnvs];
__device__ void syncthreads_for_env(){
// sync threads for each block so all threads in this block finished the previous tasks
__syncthreads();
// sync threads for wkBlocksPerEnv blocks for each environment
if(wkBlocksPerEnv > 1){
const int kThisEnvId = get_env_scope_block_id(blockIdx.x);
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
// *** I can print here 1
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
// *** I can print here 2
}
// Do the next job ...
}
}
There are two potential issues with your code. Caching and block scheduling.
Caching can prevent you from observing an updated value during the while loop.
Block scheduling can cause a dead-lock if you wait for an update of a block which has not yet been scheduled. Since CUDA does not guarantee a specific order of scheduled blocks, the only way to prevent this dead-lock is to limit the number of blocks in the grid such that all blocks can run simultaneously.
Following code shows how you could synchronize multiple blocks while avoiding above issues. I adapted the code from the multi-grid synchronization given in the CUDA-sample conjugateGradientMultiDeviceCG https://github.com/NVIDIA/cuda-samples/blob/master/Samples/4_CUDA_Libraries/conjugateGradientMultiDeviceCG/conjugateGradientMultiDeviceCG.cu#L186
On pre-Volta devices, it uses volatile memory accesses. Volta and later uses acquire/release semantics.
Grid size is limited by querying device properties.
#include <cassert>
#include <cstdio>
constexpr int wkBlocksPerEnv = 13;
__device__
int getEnv(int blockId){
return blockId / wkBlocksPerEnv;
}
__device__
int getRankInEnv(int blockId){
return blockId % wkBlocksPerEnv;
}
__device__
unsigned char load_arrived(unsigned char *arrived) {
#if __CUDA_ARCH__ < 700
return *(volatile unsigned char *)arrived;
#else
unsigned int result;
asm volatile("ld.acquire.gpu.global.u8 %0, [%1];"
: "=r"(result)
: "l"(arrived)
: "memory");
return result;
#endif
}
__device__
void store_arrived(unsigned char *arrived,
unsigned char val) {
#if __CUDA_ARCH__ < 700
*(volatile unsigned char *)arrived = val;
#else
unsigned int reg_val = val;
asm volatile(
"st.release.gpu.global.u8 [%1], %0;" ::"r"(reg_val) "l"(arrived)
: "memory");
// Avoids compiler warnings from unused variable val.
(void)(reg_val = reg_val);
#endif
}
#if 0
//wrong implementation which does not synchronize. to check that kernel assert does trigger without proper synchronization
__device__
void syncthreads_for_env(unsigned char* temp){
}
#else
//temp must have at least size sizeof(unsigned char) * total_number_of_blocks in grid
__device__
void syncthreads_for_env(unsigned char* temp){
__syncthreads();
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
unsigned char* const mytemp = temp + env * wkBlocksPerEnv;
if(threadIdx.x == 0){
if(blockInEnv == 0){
// Leader block waits for others to join and then releases them.
// Other blocks in env can arrive in any order, so the leader have to wait for
// all others.
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
while (load_arrived(&mytemp[i]) == 0)
;
}
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
store_arrived(&mytemp[i], 0);
}
__threadfence();
}else{
// Other blocks in env note their arrival and wait to be released.
store_arrived(&mytemp[blockInEnv - 1], 1);
while (load_arrived(&mytemp[blockInEnv - 1]) == 1)
;
}
}
__syncthreads();
}
#endif
__global__
void kernel(unsigned char* synctemp, int* array){
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
if(threadIdx.x == 0){
array[blockIdx.x] = 1;
}
syncthreads_for_env(synctemp);
if(threadIdx.x == 0){
int sum = 0;
for(int i = 0; i < wkBlocksPerEnv; i++){
sum += array[env * wkBlocksPerEnv + i];
}
assert(sum == wkBlocksPerEnv);
}
}
int main(){
const int smem = 0;
const int blocksize = 128;
int deviceId = 0;
int numSMs = 0;
int maxBlocksPerSM = 0;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, deviceId);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxBlocksPerSM,
kernel,
blocksize,
smem
);
int maxBlocks = maxBlocksPerSM * numSMs;
maxBlocks -= maxBlocks % wkBlocksPerEnv; //round down to nearest multiple of wkBlocksPerEnv
printf("wkBlocksPerEnv %d, maxBlocks: %d\n", wkBlocksPerEnv, maxBlocks);
int* d_array;
unsigned char* d_synctemp;
cudaMalloc(&d_array, sizeof(int) * maxBlocks);
cudaMalloc(&d_synctemp, sizeof(unsigned char) * maxBlocks);
cudaMemset(d_synctemp, 0, sizeof(unsigned char) * maxBlocks);
kernel<<<maxBlocks, blocksize>>>(d_synctemp, d_array);
cudaFree(d_synctemp);
cudaFree(d_array);
return 0;
}
Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.
Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.
I like to do CUDA synchronization for multiple blocks.
You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
// OH NO!!
{
}
}
This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as #AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.
Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.
Now, #AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).
a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

Is there a race condition in the code of this Parallel Forall blogopost?

I've recently stumbled upon this blogpost in the NVIDIA devblogs:
https://devblogs.nvidia.com/parallelforall/accelerating-graph-betweenness-centrality-cuda/
I´ve implented the edge parallel code and it seems to work as intended, however it seems to me that the code works with a race condition "controlled" with __syncthreads.
This is the code (as shown in the blog):
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done){
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
__syncthreads();
current_depth++;
}
}
I think there is a race condition just at the end:
__syncthreads();
current_depth++;
I think the program is relying on the race condition so the variable gets increased only by one, instead of by the number of threads. I don't feel like this is a good idea, but in my tests it seems to be reliable.
Is this really safe? Is there a better way to do it?
Thanks.
As the author of this blog post, I'd like to thank you for pointing out this error!
When I wrote this snippet I didn't use my verbatim edge-traversal code as that used explicit queuing to traverse the graph which makes the example more complicated without adding any pedagogical value. Instead I must have cargo-culted some old code and posted it incorrectly. It's been quite a while since I've touched this code or algorithm, but I believe the following snippet should work:
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done)
{
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
}
__syncthreads(); //All threads reach here, no longer UB
if(idx == 0){ //Only one thread should increment this shared variable
current_depth++;
}
}
Notes:
Looks like a similar issue exists in the node parallel algorithm on the blog post
You could also use a register instead of a shared variable for current_depth, in which case every thread would have to increment it
So to answer your question, no, that method is not safe. If I'm not mistaken the blog snippet has the additional issue that current_depth should only be incremented once all vertices at the previous depth were handled, which is at the conclusion of the for loop.
Finally, if you'd like the final version of my code that has been tested and used by people in the community, you can access it here: https://github.com/Adam27X/hybrid_BC

In CUDA, how do I detect that __syncthreads() was not called by all threads in the block?

I just ran into a weird and hard to reproduce problem in CUDA which turned out to involve undefined behaviour. I wanted thread 0 to set up some value in shared memory which should be used by all the threads.
__shared__ bool p;
p = false;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);
Now the assert(p); failed seemingly at random as I shoveled the code around and commented it out to find the issue.
I had used this construction in effectively the following undefined-behaviour context:
#include <assert.h>
__global__ void test() {
if (threadIdx.x == 0) __syncthreads(); // call __syncthreads in thread 0 only: this is a very bad idea
// everything below may exhibit undefined behaviour
// If the above __syncthreads runs only in thread 0, this will fail for all threads not in the first warp
__shared__ bool p;
p = false;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);
}
int main() {
test << <1, 32 + 1 >> > (); // nothing happens if you have only one warp, so we use one more thread
cudaDeviceSynchronize();
return 0;
}
The earlier __synchthreads() only reached by one thread was of course hidden in some functions, so it was hard to find. On my setup (sm50, gtx 980), this kernels runs through (no deadlock as advertised...) and the assertion fails for all threads outside of the first warp.
TL;DR
Is there any standard way to detect __syncthreads() not being called by all threads in a block? Maybe some debugger setting I am missing?
I could maybe construct my own (very slow) checked__syncthreads() that could detect the situation using maybe atomics and global memory, but I'd rather have a standard solution.
You have a threaded data race condition in your original code.
Thread 0 may advance up to and execute "p=true", but after that, a different thread might not have progressed at all and will still be back at the p=false line, overwriting the result.
Easiest fix for this specific example would simply to have ONLY thread 0 write to p, something like
__shared__ bool p;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);

How to select a GPU with CUDA?

I have a computer with 2 GPUs; I wrote a CUDA C program and I need to tell it somehow that I want to run it on just 1 out of the 2 graphic cards; what is the command I need to type and how should I use it? I believe somehow that is related to the cudaSetDevice but I can't really find out how to use it.
It should be pretty much clear from documentation of cudaSetDevice, but let me provide following code snippet.
bool IsGpuAvailable()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.major >= 2
&& deviceProperties.minor >= 0)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
This is how I iterated through all available GPUs (cudaGetDeviceCount) looking for the first one of Compute Capability of at least 2.0. If such device was found, then I used cudaSetDevice so all the CUDA computations were executed on that particular device. Without executing the cudaSetDevice your CUDA app would execute on the first GPU, i.e. the one with deviceIndex == 0 but which particular GPU is that depends on which GPU is in which PCIe slot.
EDIT:
After clarifying your question in comments, it seems to me that it should be suitable for you to choose the device based on its name. If you are unsure about your actual GPU names, then run this code which will print names of all your GPUs into console:
int devicesCount;
cudaGetDeviceCount(&devicesCount);
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
cout << deviceProperties.name << endl;
}
After that, choose the name of the GPU that you want to use for computations, lets say it is "GTX XYZ". Call the following method from your main method, thanks to it, all the CUDA kernels will be executed on the device with name "GTX XYZ". You should also check the return value - true if device with such name is found, false otherwise:
bool SetGPU()
{
int devicesCount;
cudaGetDeviceCount(&devicesCount);
string desiredDeviceName = "GTX XYZ";
for(int deviceIndex = 0; deviceIndex < devicesCount; ++deviceIndex)
{
cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceIndex);
if (deviceProperties.name == desiredDeviceName)
{
cudaSetDevice(deviceIndex);
return true;
}
}
return false;
}
Of course you have to change the value of desiredDeviceName variable to desired value.
Searching more carefully in the internet I found this lines of code that select the GPU with more cores among all the devices installed in the Pc.
int num_devices, device;
cudaGetDeviceCount(&num_devices);
if (num_devices > 1) {
int max_multiprocessors = 0, max_device = 0;
for (device = 0; device < num_devices; device++) {
cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
if (max_multiprocessors < properties.multiProcessorCount) {
max_multiprocessors = properties.multiProcessorCount;
max_device = device;
}
}
cudaSetDevice(max_device);
}

Will other threads block in this code with CUDA?

I am new in CUDA programming and have a strange behaviour.
I have a kernel like this:
__global__ void myKernel (uint64_t *input, int numOfBlocks, uint64_t *state) {
int const t = blockIdx.x * blockDim.x + threadIdx.x;
int i;
for (i = 0; i < numOfBlocks; i++) {
if (t < 32) {
if (t < 8) {
state[t] = state[t] ^ input[t];
}
if (t < 25) {
deviceFunc(device_state); /* will use some printf() */
}
}
}
}
I run this kernel with this parameter:
myKernel<<<1, 32>>>(input, numOfBlocks, state);
If 'numOfBlocks' is equal to 1, it will work fine, I get the result I expect back and the printf() inside the deviceFunc() are in the correct order.
If 'numOfBlocks' is equal to 2, it does not work fine! The result is not that what I expected and the printf() are not in the correct order (I only use printf() from thread 0)!
So, my question is now: The left threads from (32-25) which ARE NOT calling deviceFunc(), will they wait and block and this position or will they run the again and start over with the next for-loop iteration? I always thought that every line in the kernel is synchronized in the same block.
I worked the whole day on this and I finally found a solution. First, you are right that I had in my deviceFunc() many RAW hazards. I started to put some __syncthreads() after any WRITE operation, but I think this slows down my program. And I don't think that __syncthreads() is the common way to resolve them. Funny is, that the result is still the same with and without __syncthreads().
But my problem in my code above is that I used
input[t]
which was wrong, because I had to include 'numOfBlocks' in my calculation of index:
input[(NUM_OF_XOR_THREADS * i) + t)
Now, the result was correct and my problem is solved.