OpenMP parallelize for loop inside a function - function

I am trying to parallelize this for loop inside a function using OpenMP, but when I compile the code I still have an error =(
Error 1 error C3010: 'return' : jump out of OpenMP structured block not allowed.
I am using Visual studio 2010 C++ compiler. Can anyone help me? I appreciate any advice.
int match(char* pattern, int patternSize, char* string, int startFrom, unsigned int &comparisons) {
comparisons = 0;
#pragma omp for
for (int i = 0; i < patternSize; i++){
comparisons++;
if (pattern[i] != string[i + startFrom])
return 0;
}
return 1;
}

As #Hristo has already mentioned, you are not allowed to branch out of a parallel region in OpenMP. Among other reasons, this is not allowed because the compiler cannot know a priori how many iterations each thread should work on when it splits up a for loop like the one that you have written among the different threads.
Furthermore, even if you could branch out of your loop, you should be able to see that comparisons would be computed incorrectly. As is, you have an inherently serial algorithm that breaks at the first different character. How could you split up this work such that throwing more threads at this algorithm possibly makes it faster?
Finally, note that there is very little work being done in this loop anyway. You would be very unlikely to see any benefit from OpenMP even if you could rewrite this algorithm into a parallel algorithm. My suggestion: drop OpenMP from this loop and look to implement it somewhere else (either at a higher level - maybe you call this method on different strings? - or in a section of your code that does more work).

Related

What is faster on GPU? Typecasting bool to int or using a branch statement?

I am trying to use CUDA in order to parallelize the simulated annealing algorithm. The GPU I am using is NVIDIA GTX660. I am trying to speed the program up and in order to do so I am considering to replace this
int r= rand();
if (condition)
{
r += 1;
}
with
int r = rand() + (condition)*1;
I understand that jump/branch instructions(like if-then-else commands) are the slowest to execute but unless my understanding is incorrect typecasting involves memory access then copying the number in new location as an int before accessing it. Could the result of 'condition' be stored in a register and fed in ALU without modification? if so wouldn't that be a faster way to calculate the value of variable r? The above runs on every thread.
Generally, you'd try very hard to avoid branching on GPUs, since that's classically the point where the CPU needs to halt all threads that don't go through that branch, execute those who do, then halt these, and do the other branch.
That being said, the branching doesn't happen because you write if; it happens because you use e.g. < which assigns a value to a register based on what you're comparing, but that is very very depending on your actual condition, and the language/architecture you're on – my knowledge is from first-generation CUDA and might not fully apply anymore.

CUDA Reduction - atomic vs single thread summation

I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).
if(i==0){
*sum = 0.; // Initialize to 0
}
__syncthreads();
if (tdx == 0){
x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();
Then I want to sum the first x elements (as many as I have blocks).
When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":
if(i == 0){
for(int k = 0; k < gridDim.x; k++){
atomicAdd(sum, x[k]); //Works good
//sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
}
}
Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!
__syncthreads() only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.
The incorrect result is due to a synchronization problem. The operands x[k] are the outcomes of the computations from different blocks: x[0] is the result from block 0, x[1] is the result from block 1, etc. Thread 0 could start adding them up before some blocks have really finished their computations.
You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k]; can now work.
As has been pointed out, your problem is due to missing synchronisation after the first pass since you cannot synchronise between blocks. There is a good walkthrough on reduction in the sample codes provided with the toolkit.
Having said that, I would strongly recommend that people don't write reduction kernels (or other primitives such as scan) where such primitives exist in library code. Much better to invest your effort elsewhere and reuse existing optimised code where it is available. This doesn't apply if you're doing this to learn of course!
I recommend you take a look at Thrust and CUB.

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

How can a compiler apply function elimination to impure functions?

Often times when writing code, I find myself using a value from a particular function call multiple times. I realized that an obvious optimization would be to capture these repeatedly used values in variables.
This (pseudo code):
function add1(foo){ foo + 1; }
...
do_something(foo(1));
do_something_else(foo(1));
Becomes:
function add1(foo){ foo + 1; }
...
bar = foo(1);
do_something(bar);
do_something_else(bar);
However, doing this explicitly makes code less readable in my experience. I assumed that compilers could not do this kind of optimization if our language of choice allows functions to have side-effects.
Recently I looked into this, and if I understand correctly, this optimization is/can be done for languages where functions must be pure. That does not surprise me, but supposedly this can also be done for impure functions. With a few quick Google searches I found these snippets:
GCC 4.7 Fortran improvement
When performing front-end-optimization, the -faggressive-function-elimination option allows the removal of duplicate function calls even for impure functions.
Compiler Optimization (Wikipedia)
For example, in some languages functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can immediately infer that the function's result need be computed only once. In languages where functions are allowed to have side effects, another strategy is possible. The optimizer can determine which function has no side effects, and restrict such optimizations to side effect free functions. This optimization is only possible when the optimizer has access to the called function.
From my understanding, this means that an optimizer can determine when a function is or is not pure, and perform this optimization when the function is. I say this because if a function always produces the same output when given the same input, and is side effect free, it would fulfill both conditions to be considered pure.
These two snippets raise two questions for me.
How can a compiler be able to safely make this optimization if a function is not pure? (as in -faggressive-function-elimination)
How can a compiler determine whether a function is pure or not? (as in the strategy suggested in the Wikipedia article)
and finally:
Can this kind of optimization be applied to any language, or only when certain conditions are met?
Is this optimization a worthwhile one even for extremely simple functions?
How much overhead does storing and retrieving a value from the stack incur?
I apologize if these are stupid or illogical questions. They are just some things I have been curious about lately. :)
Disclaimer: I'm not a compiler/optimizer guy, I only have a tendency to peek at the generated code, and like to read about that stuff - so that's not autorative. A quick search didn't turn up much on -faggressive-function-elimination, so it might do some extra magic not explained here.
An optimizer can
attempt to inline the function call (with link time code generation, this works even across compilation units)
perform common subexpression elimination, and, possibly, side effect reordering.
Modifying your example a bit, and doing it in C++:
extern volatile int RW_A = 0; // see note below
int foo(int a) { return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
}
Resolves to (pseudocode)
<register> = 4;
RW_A = register;
RW_A = register;
(Note: reading from and writing to a volatile variable is an "observable side effect", that the optimizer must preserve in the same order given by the code.)
Modifying the example for foo to have a side effect:
extern volatile int RW_A = 0;
extern volatile int RW_B = 0;
int accu = 1;
int foo(int a) { accu *= 2; return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
RW_B = accu;
return 0;
}
generates the following pseudocode:
registerA = accu;
registerA += registerA;
accu = registerA;
registerA += registerA;
registerC = 4;
accu = registerA;
RW_A = registerC;
RW_A = registerC;
RW_B = registerA;
We observe that common subexpression elimination is still done, and separated from the side effects. Inlining and reordering allows to separate the side effects from the "pure" part.
Note that the compiler reads and eagerly writes back to accu, which wouldn't be necessary. I'm not sure on the rationale here.
To conclude:
A compiler does not need to test for purity. It can identify side effects that need to be preserved, and then transform the rest to its liking.
Such optimizations are worthwhile, even for trivial functions, because, among others,
CPU and memory are shared resources, what you use you might take away from someone else
Battery life
Minor optimizations may be gateways to larger ones
The overhead for a stack memory access is usually ~1 cycle, since the top of stack is usually in Level 1 cache already. Note that the usually should be in bold: it can be "even better", since the read / write may be optimized away, or it can be worse since the increased pressure on L1 cache flushes some other important data back to L2.
Where's the limit?
Theoretically, compile time. In practice, predictability and correctness of the optimizer are additional tradeoffs.
All tests with VC2008, default optimization settings for "Release" build.

Loop termination conditions

These for-loops are among the first basic examples of formal correctness proofs of algorithms. They have different but equivalent termination conditions:
1 for ( int i = 0; i != N; ++i )
2 for ( int i = 0; i < N; ++i )
The difference becomes clear in the postconditions:
The first one gives the strong guarantee that i == N after the loop terminates.
The second one only gives the weak guarantee that i >= N after the loop terminates, but you will be tempted to assume that i == N.
If for any reason the increment ++i is ever changed to something like i += 2, or if i gets modified inside the loop, or if N is negative, the program can fail:
The first one may get stuck in an infinite loop. It fails early, in the loop that has the error. Debugging is easy.
The second loop will terminate, and at some later time the program may fail because of your incorrect assumption of i == N. It can fail far away from the loop that caused the bug, making it hard to trace back. Or it can silently continue doing something unexpected, which is even worse.
Which termination condition do you prefer, and why? Are there other considerations? Why do many programmers who know this, refuse to apply it?
I tend to use the second form, simply because then I can be more sure that the loop will terminate. I.e. it's harder to introduce a non-termination bug by altering i inside the loop.
Of course, it also has the slightly lazy advantage of being one less character to type ;)
I would also argue, that in a language with sensible scope rules, as i is declared inside the loop construct, it shouldn't be available outside the loop. This would mitigate any reliance on i being equal to N at the end of the loop...
We shouldn't look at the counter in isolation - if for any reason someone changed the way the counter is incremented they would change the termination conditions and the resulting logic if it's required for i==N.
I would prefer the the second condition since it's more standard and will not result in endless loop.
In C++, using the != test is preferred for generality. Iterators in C++ have various concepts, like input iterator, forward iterator, bidirectional iterator, random access iterator, each of which extends the previous one with new capabilities. For < to work, random access iterator is required, whereas != merely requires input iterator.
If you trust your code, you can do either.
If you want your code to be readable and easily understood (and thus more tolerant to change from someone who you've got to assume to be a klutz), I'd use something like;
for ( int i = 0 ; i >= 0 && i < N ; ++i)
I always use #2 as then you can be sure the loop will terminate... Relying on it being equal to N afterwards is relying on a side effect... Wouldn't you just be better using the variable N itself?
[edit] Sorry...I meant #2
I think most programmers use the 2nd one, because it helps figure out what goes on inside the loop. I can look at it, and "know" that i will start as 0, and will definitely be less than N.
The 1st variant doesn't have this quality. I can look at it, and all I know is that i will start as 0 and that it won't ever be equal to N. Not quite as helpful.
Irrespective of how you terminate the loop, it is always good to be very wary of using a loop control variable outside the loop. In your examples you (correctly) declare i inside the loop, so it is not in scope outside the loop and the question of its value is moot...
Of course, the 2nd variant also has the advantage that it's what all of the C references I have seen use :-)
In general I would prefer
for ( int i = 0; i < N; ++i )
The punishment for a buggy program in production, seems a lot less severe, you will not have a thread stuck forever in a for loop, a situation that can be very risky and very hard to diagnose.
Also, in general I like to avoid these kind of loops in favour of the more readable foreach style loops.
I prefer to use #2, only because I try not to extend the meaning of i outside of the for loop. If I were tracking a variable like that, I would create an additional test. Some may say this is redundant or inefficient, but it reminds the reader of my intent: At this point, i must equal N
#timyates - I agree one shouldn't rely on side-effects
I think you stated very well the difference between the two. I do have the following comments, though:
This is not "language-agnostic", I can see your examples are in C++ but there
are languages where you are not allowed to modify the loop variable inside the
loop and others that don't guarantee that the value of the index is usable after
the loop (and some do both).
You have declared the i
index within the for so I would not bet on the value of i after the loop.
The examples are a little bit misleading as they implictly assume that for is
a definite loop. In reality it is just a more convenient way of writing:
// version 1
{ int i = 0;
while (i != N) {
...
++i;
}
}
Note how i is undefined after the block.
If a programmer knew all of the above would not make general assumption of the value of i and would be wise enough to choose i<N as the ending conditions, to ensure that the the exit condition will be eventually met.
Using either of the above in c# would cause a compiler error if you used i outside the loop
I prefer this sometimes:
for (int i = 0; (i <= (n-1)); i++) { ... }
This version shows directly the range of values that i can have. My take on checking lower and upper bound of the range is that if you really need this, your code has too many side effects and needs to be rewritten.
The other version:
for (int i = 1; (i <= n); i++) { ... }
helps you determine how often the loop body is called. This also has valid use cases.
For general programming work I prefer
for ( int i = 0; i < N; ++i )
to
for ( int i = 0; i != N; ++i )
Because it is less error prone, especially when code gets refactored. I have seen this kind of code turned into an infinite loop by accident.
That argument made that "you will be tempted to assume that i == N", I don't believe is true. I have never made that assumption or seen another programmer make it.
From my standpoint of formal verification and automatic termination analysis, I strongly prefer #2 (<). It is quite easy to track that some variable is increased (before var = x, after var = x+n for some non-negative number n). However, it is not that easy to see that i==N eventually holds. For this, one needs to infer that i is increased by exactly 1 in each step, which (in more complicated examples) might be lost due to abstraction.
If you think about the loop which increments by two (i = i + 2), this general idea becomes more understandable. To guarantee termination one now needs to know that i%2 == N%2, whereas this is irrelevant when using < as the condition.