branch divergence in cuda

branch divergence in cuda - cuda

I am looking to change this code to prevent so much branch divergence
if (v == u) {
++c;
++u_t;
++v_t;
}
else if (v < u){
++u_t;
}
else {
++v_t;
}
Here is what I tried:
u_t++;
if(v == u){
++c;
++v_t;
}
else{
--u_t;
++v_t
}
Although this code is giving me the wrong answer for the whole program. Am I missing something obvious here?

It all comes down to
if (v == u) ++c;
if (v <= u) ++u_t;
if (v >= u) ++v_t;
Can you optimise this? Not sure you can without knowing anything about the rest of the code.

Related

Is there a race condition in the code of this Parallel Forall blogopost?

I've recently stumbled upon this blogpost in the NVIDIA devblogs:
https://devblogs.nvidia.com/parallelforall/accelerating-graph-betweenness-centrality-cuda/
I´ve implented the edge parallel code and it seems to work as intended, however it seems to me that the code works with a race condition "controlled" with __syncthreads.
This is the code (as shown in the blog):
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done){
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
__syncthreads();
current_depth++;
}
}
I think there is a race condition just at the end:
__syncthreads();
current_depth++;
I think the program is relying on the race condition so the variable gets increased only by one, instead of by the number of threads. I don't feel like this is a good idea, but in my tests it seems to be reliable.
Is this really safe? Is there a better way to do it?
Thanks.

As the author of this blog post, I'd like to thank you for pointing out this error!
When I wrote this snippet I didn't use my verbatim edge-traversal code as that used explicit queuing to traverse the graph which makes the example more complicated without adding any pedagogical value. Instead I must have cargo-culted some old code and posted it incorrectly. It's been quite a while since I've touched this code or algorithm, but I believe the following snippet should work:
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done)
{
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
}
__syncthreads(); //All threads reach here, no longer UB
if(idx == 0){ //Only one thread should increment this shared variable
current_depth++;
}
}
Notes:
Looks like a similar issue exists in the node parallel algorithm on the blog post
You could also use a register instead of a shared variable for current_depth, in which case every thread would have to increment it
So to answer your question, no, that method is not safe. If I'm not mistaken the blog snippet has the additional issue that current_depth should only be incremented once all vertices at the previous depth were handled, which is at the conclusion of the for loop.
Finally, if you'd like the final version of my code that has been tested and used by people in the community, you can access it here: https://github.com/Adam27X/hybrid_BC

CUDA threads appear to be out of sync

I have an issue where it appears that a single thread is trailing behind the rest, even though i'm using syncthreads. The following extract is taken from a large program, where I've cut out as much as I can yet it still reproduces my problem. What I find is that upon running this code the test4 variable does not return the same value for all threads. My understanding is that using the TEST_FLAG variable it should lead all threads into the if (TEST_FLAG == 2) condition and therefore every element in the array test4 should return a value of 43. However what I find is that all elements return 43, except thread 0 which returns 0. It appears as if the threads are not all getting to the same syncthreads. I've performed numerous tests and I've found that removing more of the code, such as the for (l=0; l<1; ++l) loop resolves the issue, but I do not understand why. Any help as to why my threads are not all returning the same value would be greatly appreciated.
import numpy as np
import pycuda.driver as drv
import pycuda.compiler
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
from pycuda.compiler import SourceModule
gpu_code=SourceModule("""
__global__ void test_sync(double *test4, double *test5)
{
__shared__ double rad_loc[2], boundary[2], boundary_limb_edge[2];
__shared__ int TEST_FLAG;
int l;
if (blockIdx.x != 0)
{
return;
}
if(threadIdx.x == 0)
{
TEST_FLAG = 2;
boundary[0] = 1;
}
test4[threadIdx.x] = 0;
test5[threadIdx.x] = 0;
if (threadIdx.x == 0)
{
rad_loc[0] = 0.0;
}
__syncthreads();
for (l=0; l<1; ++l)
{
__syncthreads();
if (rad_loc[0] > 0.0)
{
test5[threadIdx.x] += 1;
if ((int)boundary[0] == -1)
{
__syncthreads();
continue;
}
}
else
{
if (threadIdx.x == 0)
{
boundary_limb_edge[0] = 0.0;
}
}
__syncthreads();
if (TEST_FLAG == 2)
{
test4[threadIdx.x] = 43;
__syncthreads();
TEST_FLAG = 99;
}
__syncthreads();
return;
}
return;
}
""")
test_sync = gpu_code.get_function("test_sync")
DATA_ROWS=[100,100]
blockshape_data_mags = (int(64),1, 1)
gridshape_data_mags = (int(sum(DATA_ROWS)), 1)
test4 = np.zeros([1*blockshape_data_mags[0]], np.float64)
test5 = np.zeros([1*blockshape_data_mags[0]], np.float64)
test_sync(drv.InOut(test4), drv.InOut(test5), block=blockshape_data_mags, grid=gridshape_data_mags)
print test4
print test5

As Yuuta mentioned, __syncthreads() behavior is not defined in conditional statements. Thus having it there may/may not work as expected. You may want to re-write your code to avoid getting __syncthreads() into your if conditions.
You may check this answer and this paper for more information on __syncthreads().
It is also important to notice that it is a block level barrier. You can't synchronize different blocks using __syncthreads(). Blocks must be synchronized by kernel calls.

Your problem is with the statement TEST_FLAG=99. For one of the threads, it is executed before thread 0 enters the conditional block, and gives you undefined behavior. If I comment out TEST_FLAG=99, the code runs as expected.

C++ if statement seems to ignore the argument

Here's the code.
bool b_div(int n_dividend)
{
for (int iii = 10 ; iii>0 ; iii--)
{
int n_remainder = n_dividend%iii;
if (n_remainder != 0)
return false;
if (iii = 1)
return true;
}
}
After testing this function I made for a program, the function seems to stop at the if (n_remainder != 0) part. Now then the function SHOULD test if the number that the function takes in can be divided by all numbers from 10 to 1.(it takes in numbers until it returns true) I know the first number that this works with it is 2520 but even on this number it stops at if(n_remainder != 0). So I was hoping for some advice! Im having trouble troubleshooting it! Any links or words I should look for would be awesome! Im still pretty new to programming so any help you can give for learning would rock! Thanks!

Change your last if statement to:
if (iii == 1)
return true;
Currently you have only a single equals sign, which sets the variable iii to 1, and is always true. By using a double equals it will compare iii and 1.

In addition to SC Ghost's answer, you can actually also clean up your function a bit more :)
bool b_div(int n_dividend) {
for (int i = 10 ; i > 1 ; i--) {
int n_remainder = n_dividend % i;
if (n_remainder != 0) {
return false;
}
}
return true;
}
A few notes,
modulus of 1 will always be zero, so you only need to iterate while i > 1
you can completely remove the if(i == 1) check and just always return true after the for loop if the for loop doesn't return false. It basically removes an unnecessary check.
I think it's more standard to name your iterator iii as i, And I prefer brackets the way I wrote them above (this is of course completely personal preference, do as you please)

What is the difference between nested and cascaded if-else

What is the difference between nested and cascaded if-else?

These two are equivalent:
if (condition1) block1
else if (condition2) block2
if (condition1) block1
else
{
if (condition2) block2
}
I presume they also compile to the same assembly, so there should be no difference.

I depends on how you arrange them. A nested if is equivalent to adding an and to each of the inner ifs:
if(A) {
if(B) {
statement1
}
else if(C) {
statement2
}
}
is equivalent to:
if(A and B) {
statement1
}
else if(A and C) {
statement2
}
My advice is to strive for readability and check your logic. You may find DeMorgan's Laws useful for re-arranging your logic.
Here's one that always irritates me:
if(A and B) {
statement1
statement2
}
else if(A and C) {
statement1
statement3
}
else if(not A) {
statement4
}
vs
if(A) {
statement1
if(B) {
statement2
}
else if(C) {
statement3
}
}
else if(not A) {
statement4
}
I'm just not sure which is more readable. They are logically equivalent. The first is more tabular and easier on the eye but repeats statement1; the second is more nested and a little uglier (to my eye) but does not repeat statements. Ultimately it's a judgment call because it makes no difference to the compiler.

Nested if-then-else control structures are minimized translations of complex logic rules. They are good in avoiding redundancy in checking conditions. Their main drawback is that in the long run these structures can grow and make enclosing methods too big and complex. First step in breaking nested if-then-else blocks is normalization. For example:
if (A) {
if (B || C) {
block 1;
} else {
if (!D) {
block 2;
}
}
} else {
block 3;
}
can be normalized to cascaded if-then-else
if (A && (B || C) {
block 1;
return;
}
if (A && !B && !C && !D) {
block 2;
return;
}
if (!A) {
block 3;
}
We have eliminated else blocks and made further extract method refactoring easy. All three if blocks can be extracted to separate methods named after the business logic their bodies execute.

How do you implement XOR using +-*/?

How can the XOR operation (on two 32 bit ints) be implemented using only basic arithmetic operations? Do you have to do it bitwise after dividing by each power of 2 in turn, or is there a shortcut? I don't care about execution speed so much as about the simplest, shortest code.
Edit:
This is not homework, but a riddle posed on a hacker.org. The point is to implement XOR on a stack-based virtual machine with very limited operations (similar to the brainfuck language and yes - no shift or mod). Using that VM is the difficult part, though of course made easier by an algorithm that is short and simple.
While FryGuy's solution is clever, I'll have to go with my original ideal (similar to litb's solution) because comparisons are difficult to use as well in that environment.

I would do it the simple way:
uint xor(uint a, uint b):
uint ret = 0;
uint fact = 0x80000000;
while (fact > 0)
{
if ((a >= fact || b >= fact) && (a < fact || b < fact))
ret += fact;
if (a >= fact)
a -= fact;
if (b >= fact)
b -= fact;
fact /= 2;
}
return ret;
There might be an easier way, but I don't know of one.

I don't know whether this defeats the point of your question, but you can implement XOR with AND, OR, and NOT, like this:
uint xor(uint a, uint b) {
return (a | b) & ~(a & b);
}
In english, that's "a or b, but not a and b", which maps precisely to the definition of XOR.
Of course, I'm not sticking strictly to your stipulation of using only the arithmetic operators, but at least this a simple, easy-to-understand reimplementation.

I'm sorry i only know the straight forward one in head:
uint32_t mod_op(uint32_t a, uint32_t b) {
uint32_t int_div = a / b;
return a - (b * int_div);
}
uint32_t xor_op(uint32_t a, uint32_t b) {
uint32_t n = 1u;
uint32_t result = 0u;
while(a != 0 || b != 0) {
// or just: result += n * mod_op(a - b, 2);
if(mod_op(a, 2) != mod_op(b, 2)) {
result += n;
}
a /= 2;
b /= 2;
n *= 2;
}
return result;
}
The alternative in comments can be used instead of the if to avoid branching. But then again, the solution isn't exactly fast either and it makes it look stranger :)

It's easier if you have the AND because
A OR B = A + B - (A AND B)
A XOR B = A + B - 2(A AND B)
int customxor(int a, int b)
{
return a + b - 2*(a & b);
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

branch divergence in cuda - cuda

It all comes down to if (v == u) ++c; if (v <= u) ++u_t; if (v >= u) ++v_t; Can you optimise this? Not sure you can without knowing anything about the rest of the code.

Related

Is there a race condition in the code of this Parallel Forall blogopost?

CUDA threads appear to be out of sync

C++ if statement seems to ignore the argument

What is the difference between nested and cascaded if-else

How do you implement XOR using +-*/?

Categories

Resources