Wondering what a continue statement does in a do...while(false) loop, I mocked up a simple test-case (pseudo-code):
count = 0;
do {
output(count);
count++;
if (count < 10)
continue;
}while (false);
output('out of loop');
The output was, to my surprise:
0
out of loop
A bit confused, I changed the loop from a do...while to a for:
for (count = 0; count == 0; count++) {
output(count);
if (count < 10)
continue;
}
output('out of loop');
While functionally not the same, the purpose is practically the same: Make a condition only satisfied the first iteration, and in next ones continue (until a certain value is reached, purely for stopping possible infinite-loops.) They might not run the same amount of times, but functionality here isn't the important bit.
The output was the same as before:
0
out of loop
Now, put into terms of a simple while loop:
count = 0;
while (count == 0) {
output(count);
count++;
if (count < 10)
continue;
}
output('out of loop');
Once again, same output.
This is a bit confusing, as I've always thought of the continue statement as "jump to the next iteration". So, here I ask: What does a continue statement do in each of these loops? Does it just jump to the condition?
((For what it's worth, I tested the above in JavaScript, but I believe it's language-agnostic...js had to get at least that right))
In a for loop, continue runs the 3rd expression of the for statement (usually used as some kind of iteration), then the condition (2nd expression), and then the loop if the condition is true. It does not run the rest of the current iteration of the loop.
In a while (or do-while) loop, it just runs the condition and then the loop if the condition holds. It also does not run the rest of the current iteration of the loop.
Your definition of continue statement as "jump to the next iteration" is correct. This will force the program to start next iteration by first re-evaluating the conditional expression.
The problem with your snippets is that they all exit after one iteration because your conditional expressions are set to either false or count ==0. This will always return false after one iteration.
Moreover, putting continue statement at the end of the loop is meaningless. It will re-evaluate the conditional expression in either case.
It's best to think of continue as jumping to the end of the enclosing loop. This may haelp:
#include <iostream>
using namespace std;
int main() {
int n = 0;
do {
cout << n << endl;
n += 1;
if ( n == 3 ) {
continue;
}
cout << "n was not 3" << endl;
} while( n != 3 );
}
which prints:
0
n was not 3
1
n was not 3
2
and terminates, because the continue jumps to the while() at the end of the loop. similar stiff happens for for() and while() loops.
continue skips to the next iteration when it is used in a loop. break exits the current block. Typically, break is used to exit a loop but it could be used to exit any block.
for (int i = 0; i < 1000; i++) {
if (some_condition) {
continue; // would skip to the next iteration
}
if (some_other_condition) {
break; // Exits the loop (block)
}
// other work
}
Related
I'm currently learning about recursion, it's pretty hard to understand. I found a very common example for it:
function factorial(N)
local Value
if N == 0 then
Value = 1
else
Value = N * factorial(N - 1)
end
return Value
end
print(factorial(3))
N == 0 is the base case. But when i changed it into N == 1, the result is still remains the same. (it will print 6).
Is using the base case important? (will it break or something?)
What's the difference between using N == 0 (base case) and N == 1?
That's just a coincidence, since 1 * 1 = 1, so it ends up working either way.
But consider the edge-case where N = 0, if you check for N == 1, then you'd go into the else branch and calculate 0 * factorial(-1), which would lead to an endless loop.
The same would happen in both cases if you just called factorial(-1) directly, which is why you should either check for > 0 instead (effectively treating every negative value as 0 and returning 1, or add another if condition and raise an error when N is negative.
EDIT: As pointed out in another answer, your implementation is not tail-recursive, meaning it accumulates memory for every recursive functioncall until it finishes or runs out of memory.
You can make the function tail-recursive, which allows Lua to treat it pretty much like a normal loop that could run as long as it takes to calculate its result:
local function factorial(n, acc)
acc = acc or 1
if n <= 0 then
return acc
else
return factorial(n-1, acc*n)
end
return Value
end
print(factorial(3))
Note though, that in the case of factorial, it would take you way longer to run out of stack memory than to overflow Luas number data type at around 21!, so making it tail-recursive is really just a matter of training yourself to write better code.
As the above answer and comments have pointed out, it is essential to have a base-case in a recursive function; otherwise, one ends up with an infinite loop.
Also, in the case of your factorial function, it is probably more efficient to use a helper function to perform the recursion, so as to take advantage of Lua's tail-call optimizations. Since Lua conveniently allows for local functions, you can define a helper within the scope of your factorial function.
Note that this example is not meant to handle the factorials of negative numbers.
-- Requires: n is an integer greater than or equal to 0.
-- Effects : returns the factorial of n.
function fact(n)
-- Local function that will actually perform the recursion.
local function fact_helper(n, i)
-- This is the base case.
if (i == 1) then
return n
end
-- Take advantage of tail calls.
return fact_helper(n * i, i - 1)
end
-- Check for edge cases, such as fact(0) and fact(1).
if ((n == 0) or (n == 1)) then
return 1
end
return fact_helper(n, n - 1)
end
I've recently stumbled upon this blogpost in the NVIDIA devblogs:
https://devblogs.nvidia.com/parallelforall/accelerating-graph-betweenness-centrality-cuda/
I´ve implented the edge parallel code and it seems to work as intended, however it seems to me that the code works with a race condition "controlled" with __syncthreads.
This is the code (as shown in the blog):
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done){
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
__syncthreads();
current_depth++;
}
}
I think there is a race condition just at the end:
__syncthreads();
current_depth++;
I think the program is relying on the race condition so the variable gets increased only by one, instead of by the number of threads. I don't feel like this is a good idea, but in my tests it seems to be reliable.
Is this really safe? Is there a better way to do it?
Thanks.
As the author of this blog post, I'd like to thank you for pointing out this error!
When I wrote this snippet I didn't use my verbatim edge-traversal code as that used explicit queuing to traverse the graph which makes the example more complicated without adding any pedagogical value. Instead I must have cargo-culted some old code and posted it incorrectly. It's been quite a while since I've touched this code or algorithm, but I believe the following snippet should work:
__shared__ int current_depth;
__shared__ bool done;
if(idx == 0){
done = false;
current_depth = 0;
}
__syncthreads();
// Calculate the number of shortest paths and the
// distance from s (the root) to each vertex
while(!done)
{
__syncthreads();
done = true;
__syncthreads();
for(int k=idx; k<m; k+=blockDim.x) //For each edge...
{
int v = F[k];
// If the head is in the vertex frontier, look at the tail
if(d[v] == current_depth)
{
int w = C[k];
if(d[w] == INT_MAX){
d[w] = d[v] + 1;
done = false;
}
if(d[w] == (d[v] + 1)){
atomicAdd(&sigma[w],sigma[v]);
}
}
}
__syncthreads(); //All threads reach here, no longer UB
if(idx == 0){ //Only one thread should increment this shared variable
current_depth++;
}
}
Notes:
Looks like a similar issue exists in the node parallel algorithm on the blog post
You could also use a register instead of a shared variable for current_depth, in which case every thread would have to increment it
So to answer your question, no, that method is not safe. If I'm not mistaken the blog snippet has the additional issue that current_depth should only be incremented once all vertices at the previous depth were handled, which is at the conclusion of the for loop.
Finally, if you'd like the final version of my code that has been tested and used by people in the community, you can access it here: https://github.com/Adam27X/hybrid_BC
In CUDA device code, the following if-else statement will cause divergence among the threads of a warp, resulting in two passes by the SIMD hardware. Assume Vs is a location in shared memory.
if (threadIdx.x % 2) {
Vs[threadIdx.x] = 0;
} else {
Vs[threadIdx.x] = 1;
}
I believe there will also be two passes when we have an if statement, with no else branch. Why is this the case?
if (threadIdx.x % 2) {
Vs[threadIdx.x] = 0;
}
Would the following if statement be completed in 3 passes?
if (threadIdx.x < 10) {
Vs[threadIdx.x] = 0;
} else if (threadIdx.x < 20) {
Vs[threadIdx.x] = 1;
} else {
Vs[threadIdx.x] = 2;
}
On a GPU, it could very well be the case that there is only one pass with an if-else statement - one predicated pass. The condition will just turn on the "do nothing" bit for half the threads during the "then" block, and turn the other half's "do nothing" bit off for the "else" block.
As #njuffa points out, however, this is dependent upon parameters such as the target architecture etc.
For more details, see:
Branch predication on GPU
For your first specific example of an if body, a compiler might not even need a predicated pass, since it can be rewritten as
Vs[threadIdx.x] = (threadIdx.x % 2 ? 0 : 1);
and that's perfectly uniform across your warp. For your last example - it really depends, but again it could theoretically be optimized by the compiler into a single unpredicated pass, and it also might be the case that you'll have a predicated single path, with different predication within each of the three scopes.
Here's the code.
bool b_div(int n_dividend)
{
for (int iii = 10 ; iii>0 ; iii--)
{
int n_remainder = n_dividend%iii;
if (n_remainder != 0)
return false;
if (iii = 1)
return true;
}
}
After testing this function I made for a program, the function seems to stop at the if (n_remainder != 0) part. Now then the function SHOULD test if the number that the function takes in can be divided by all numbers from 10 to 1.(it takes in numbers until it returns true) I know the first number that this works with it is 2520 but even on this number it stops at if(n_remainder != 0). So I was hoping for some advice! Im having trouble troubleshooting it! Any links or words I should look for would be awesome! Im still pretty new to programming so any help you can give for learning would rock! Thanks!
Change your last if statement to:
if (iii == 1)
return true;
Currently you have only a single equals sign, which sets the variable iii to 1, and is always true. By using a double equals it will compare iii and 1.
In addition to SC Ghost's answer, you can actually also clean up your function a bit more :)
bool b_div(int n_dividend) {
for (int i = 10 ; i > 1 ; i--) {
int n_remainder = n_dividend % i;
if (n_remainder != 0) {
return false;
}
}
return true;
}
A few notes,
modulus of 1 will always be zero, so you only need to iterate while i > 1
you can completely remove the if(i == 1) check and just always return true after the for loop if the for loop doesn't return false. It basically removes an unnecessary check.
I think it's more standard to name your iterator iii as i, And I prefer brackets the way I wrote them above (this is of course completely personal preference, do as you please)
I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}