for-loop mechanism efficiency tips - language-agnostic

As I am using for-loops on large multi-dim arrays, any saving on the for-loop mechanism itself is meaningful.
Accordingly, I am looking for any tips on how to reduce this overhead.
e.g. : counting down using uint instead of int and != 0 as stop instead of >0 allows the CPU to do less work (heard it once, not sure it is always true)

One important suggestion: move as much calculation to the outer loop as possible. Not all compilers can do that automatically. For eample, instead of:
for row = 0 to 999
for col = 0 to 999
cell[row*1000+col] = row * 7 + col
use:
for row = 0 to 999
x = row * 1000
y = row * 7
for col = 0 to 999
cell[x+col] = y + col

Try to make your loops contiguous in memory, this will optimize cache usage. That is, don't do this:
for (int i = 0; i < m; i++)
for (j = 0; j < n; j++)
s += arr[j][i];
If processing images, convert two loops to one loop on the pixels with a single index.
Don't make loops that will run zero times, as the pipeline is optimized to assume a loop will continue rather than end.

Have you measured the overhead? Do you know how much time is spent processing the for loops vs. how much time is spent executing your application code? What is your goal?

Loop-unrolling can be one way. That is:
for (i=0; i<N; i++) {
a[i]=...;
}
transforms into:
for (i=0; i<N; i+=4) {
a[i]=...;
a[i+1]=...;
a[i+2]=...;
a[i+3]=...;
}
You will need special handling when N is not a multiple of 4 in the example above.

First, don't sweat the small stuff. Details like counting up versus counting down are usually completely irrelevant in running time. Humans are notoriously bad at spotting areas in code that need to be sped up. Use a profiler. Pay little or no attention to any part of the loop that is not repeated, unless the profiler says otherwise. Remember that what is written in an inner loop is not necessarily executed in an inner loop, as modern compilers are pretty smart about avoiding unnecessary repetition.
That being said, be very wary of unrolling loops on modern CPUs. The tighter they are, the better they will fit into cache. In a high-performance application I worked on last year, I improved performance significantly by using loops instead of straight-line code, and tightening them up as much as I could. (Yes, I profiled; the function in question took up 80% of the run time. I also benchmarked times over typical input, so I knew the changes helped.)
Moreover, there's no harm in developing habits that favor efficient code. In C++, you should get in the habit of using pre-increment (++i) rather than post-increment (i++) to increment loop variables. It usually doesn't matter, but can make a significant difference, it doesn't make code less readable or writable, and won't hurt.

This isn't a language agnostic question, it depends highly on not only language, but also compiler. Most compilers I believe will compile these two equivalently:
for (int i = 0; i < 10; i++) { /* ... */ }
int i = 0;
while (i < 10) {
// ...
i++;
}
In most languages/compilers, the for loop is just syntactic sugar for the later while loop. Foreach is another question again, and is highly dependant on language/compiler as to how it's implemented, but it's generally less efficient that a normal for/while loop. How much more so is again, language and compiler dependant.
Your best bet would probably be to run some benchmarks with several different variations on a theme and see what comes out on top.
Edit: To that end, the suggestions here will probably save you more time rather than worrying about the loop itself.

BTW, unless you need post-increment, you should always use the pre-increment operator. It is only a minor difference, but it is more efficient.
Internally this is the difference:
Post Increment
i++;
is the same as:
int postincrement( int &i )
{
int itmp = i;
i = i + 1;
return itmp;
}
Pre Increment
++i;
is the same as:
int preincrement( int &i )
{
i = i + 1;
return i;
}

I agree with #Greg. First thing you need to do is put some benchmarking in place. There will be little point optimising anything until you prove where all your processing time is being spent. "Premature optimisation is the root of all evil"!

As your loops will have O(n^d) complexity (d=dimension), what really counts is what you put INTO the loop, not the loop itself. Optimizing a few cycles away in the loop framework from millions of cycles of an inefficient algorithm inside the loop is just snake oil.

By the way, is it good to use short instead of int in for-loop if Int16 capacity is guaranteed to be enough?

There is not enough information to answer your question accurately. What are you doing inside your loops? Does the calculation in one iteration depend on a value calculated in a previous iteration. If not, you can almost cut your time in half by simply using 2 threads, assuming you have at least a dual core processor.
Another thing to look at is how you are accessing your data, if you are doing large array processing, to make sure that you access the data sequentially as it is stored in memory, avoiding flushing your L1/L2 cache on every iteration (seen this before on smaller L1 caches, the difference can be dramatic).
Again, I would look at what is inside the loop first, where most of the gains (>99%) will be, rather than the outer loop plumbing.
But then again, if your loop code is I/O bound, then any time spent on optimization is wasted.

I think most compilers would probably do this anyway, stepping down to zero should be more efficient, as a check for zero is very fast for the processor. Again though, any compiler worth it's weight would do this with most loops anyway. You need to loo at what the compiler is doing.

There is some relevant information among the answers to another stackoverflow question, how cache memory works. I found the paper by Ulrich Drepper referred to in this answer especially useful.

Related

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, Iā€™m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

OpenMP parallelize for loop inside a function

I am trying to parallelize this for loop inside a function using OpenMP, but when I compile the code I still have an error =(
Error 1 error C3010: 'return' : jump out of OpenMP structured block not allowed.
I am using Visual studio 2010 C++ compiler. Can anyone help me? I appreciate any advice.
int match(char* pattern, int patternSize, char* string, int startFrom, unsigned int &comparisons) {
comparisons = 0;
#pragma omp for
for (int i = 0; i < patternSize; i++){
comparisons++;
if (pattern[i] != string[i + startFrom])
return 0;
}
return 1;
}
As #Hristo has already mentioned, you are not allowed to branch out of a parallel region in OpenMP. Among other reasons, this is not allowed because the compiler cannot know a priori how many iterations each thread should work on when it splits up a for loop like the one that you have written among the different threads.
Furthermore, even if you could branch out of your loop, you should be able to see that comparisons would be computed incorrectly. As is, you have an inherently serial algorithm that breaks at the first different character. How could you split up this work such that throwing more threads at this algorithm possibly makes it faster?
Finally, note that there is very little work being done in this loop anyway. You would be very unlikely to see any benefit from OpenMP even if you could rewrite this algorithm into a parallel algorithm. My suggestion: drop OpenMP from this loop and look to implement it somewhere else (either at a higher level - maybe you call this method on different strings? - or in a section of your code that does more work).

Pros and Cons of i != n vs i < n in an int for loop

What are the pros and cons of using one or the other iteration functions ?
function (int n) {
for (int i = 1; i != n; ++i) { ... }
}
vs
function (int n) {
for (int i = 1; i < n; i++) { ... }
}
I think the main argument against the first version is that it is a much less common idiom.
Remembering that code is read more often than it is written, it does not make sense to use a less familiar form of for loop if there isn't a very clear advantage to doing so. All it achieves is distracting anyone working on the code in future.
So primarily for code maintenance reasons (by others as well as the original coder) I would favour the more common second format.
The version with < will work correctly if n is less than 1. The version with != will go into an infinite loop (well, probably not infinite, as integer variables wrap around in most languages).
Using < also generalizes better. E.g.
for (i = start; i < end; i += increment)
This will work even if end - start is not a multiple of increment.
The first one is quite dangerous and could cause an infinite loop.
If n is ever less than 1, the loop will never exit.
Also if something changes i inside the loop, so that it skips the value of n, then again the loop will never exit.
Edit: OK to be more precise when I say never exit, it will ultimately exit one way or another, but it won't be in the manner most sane developers expect. I can just imagine the look on the poor guy that debugs your code that calls the database 2 billion times.

How do I explain what a "naive implementation" is? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the clearest explanation of what computer scientists mean by "the naive implementation"? I need a good clear example which will illustrate ā€” ideally, even to non-technical people ā€” that the naive implementation may technically be a functioning solution to the problem, but practically be utterly unusable.
I'd try to keep it away from computers altogether. Ask your audience how they find an entry in a dictionary. (A normal dictionary of word definitions.)
The naive implementation is to start at the very beginning, and look at the first word. Oh, that's not the word we're looking for - look at the next one, etc. It's worth pointing out to the audience that they probably didn't even think of that way of doing things - we're smart enough to discount it immediately! It is, however, about the simplest way you could think of. (It might be interesting to ask them whether they can think of anything simpler, and check that they do really understand why it's simpler than the way we actually do it.)
The next implementation (and a pretty good one) is to start in the middle of the dictionary. Does the word we're looking for come before or after that? If it's before, turn to the page half way between the start and where we are now - otherwise, turn to the page half way between where we are now and the end, etc - binary chop.
The actual human implementation is to use our knowledge of letters to get very rapidly to "nearly the right place" - if we see "elephant" then we'll know it'll be "somewhere near the start" maybe about 1/5th of the way through. Once we've got to E (which we can do with very, very simple comparisons) we find EL etc.
StackOverflow's Jeff Atwood had a great example of a naive algorithm related to shuffling an array.
Doing it the most straightforward, least tricky way available. One example is selection sort.
In this case naive does not mean bad or unusable. It just means not particularly good.
Taking Jon Skeet's advice to heart you can describe selection sort as:
Find the highest value in the list and put it first
Find the next highest value and add it to the list
Repeat step 2 until you run out of list
It is easy to do and easy to understand, but not necessarily the best.
another naive implementation would be the use of recursion in computing for an integer's factorial in an imperative language. a more efficient solution in that case is to just use a loop.
What's the most obvious, naive algorithm for exponentiation that you could think of?
base ** exp is base * base * ... * base, exp times:
double pow(double base, int exp) {
double result = 1;
for (int i = 0; i < exp; i++)
result *= base;
return result;
}
It doesn't handle negative exponents, though. Remembering that base ** exp == 1 / base ** (-exp) == (1 / base) ** (-exp):
double pow(double base, int exp) {
double result = 1;
if (exp < 0) {
base = 1 / base;
exp = -exp;
}
for (int i = 0; i < exp; i++)
result *= base;
return result;
}
It's actually possible to compute base ** exp with less than exp multiplications, though!
double pow(double base, int exp) {
double result = 1;
if (exp < 0) {
base = 1 / base;
exp = -exp;
}
while (exp) {
if (exp % 2) {
result *= base;
exp--;
}
else {
base *= base;
exp /= 2;
}
}
return result * base;
}
This takes advantage of the fact that base ** exp == (base * base) ** (exp / 2) if exp is even, and will only require about log2(exp) multiplications.
I took the time to read your question a little closer, and I have the perfect example.
a good clear example which will illustrate -- ideally, even to non-technical people -- that the naive implementation may technically be a functioning solution to the problem, but practically be utterly unusable.
Try Bogosort!
If bogosort were used to sort a deck of cards, it would consist of checking if the deck were in order, and if it were not, one would throw the deck into the air, pick up the cards up at random, and repeat the process until the deck is sorted.
"Naive implementation" is almost always synonymous with "brute-force implementation". Naive implementations are often intuitive and the first to come to mind, but are also often O(n^2) or worse, thus taking too long too be practical for large inputs.
Programming competitions are full of problems where the naive implementation will fail to run in an acceptable amount of time, and the heart of the problem is coming up with an improved algorithm that is generally much less obvious but runs much more quickly.
Naive implementation is:
intuitive;
first to come in mind;
often inffective and/or buggy incorner cases;
Let's say that someone figures out how to extract a single field from a database and then proceeds to write a web page in PHP or any language that makes a separate query on the database for each field on the page. It works, but will be incredibly slow, inefficient, and difficult to maintain.
Naive doesn't mean bad or unusable - it means having certain qualities which pose a problem in a specific context and for a specific purpose.
The classic example of course is sorting. In the context of sorting a list of ten numbers, any old algorithm (except pogo sort) would work pretty well. However, when we get to the scale of thousands of numbers or more, typically we say that selection sort is the naive algorithm because it has the quality of O(n^2) time which would be too slow for our purposes, and that the non-naive algorithm is quicksort because it has the quality of O(n lg n) time which is fast enough for our purposes.
In fact, the case could be made that in the context of sorting a list of ten numbers, quicksort is the naive algorithm, since it will take longer than selection sort.
Determining if a number is prime or not (primality test) is an excellent example.
The naive method just check if n mod x where x = 2..square root(n) is zero for at least one x. This method can get really slow for very large prime numbers and it is not feasible to use in cryptography.
On the other hand there are a couple of probability or fast deterministic tests. These are too complicated to explain here but you might want to check the relevant Wikipedia article on the subject for more information: http://en.wikipedia.org/wiki/Primality_test
Bubble sort over 100,000 thousand entries.
The intuitive algorithms you normally use to sort a deck of cards (insertion sort or selection sort, both O(n^2)) can be considered naive, because they are easy to learn and implement, but would not scale well to a deck of, say, 100000 cards :D . In a general setting, there are faster (O(n log n)) ways to sort a list.
Note, however, that naive does not necessarily mean bad. There are situations where insertion sort is a good choice (say, when you have an already sorted big deck and few unsorted cards to add).
(Haven't seen a truly naive implementation posted yet so...)
The following implementation is "naive", because it does not cover the edge cases, and will break in other cases. It is very simple to understand, and can convey a programming message.
def naive_inverse(x):
return 1/x
It will:
Break on x=0
Do a bad job when passed an integer
You could make it more "mature" by adding these features.
A O(n!) algorithm.
foreach(object o in set1)
{
foreach(object p in set1)
{
// codez
}
}
This will perform fine with small sets and then exponentially worse with larger ones.
Another might be a naive Singleton that doesn't account for threading.
public static SomeObject Instance
{
get
{
if(obj == null)
{
obj = new SomeObject();
}
return obj;
}
}
If two threads access that at the same time it's possible for them to get two different versions. Leading to seriously weird bugs.

Loop termination conditions

These for-loops are among the first basic examples of formal correctness proofs of algorithms. They have different but equivalent termination conditions:
1 for ( int i = 0; i != N; ++i )
2 for ( int i = 0; i < N; ++i )
The difference becomes clear in the postconditions:
The first one gives the strong guarantee that i == N after the loop terminates.
The second one only gives the weak guarantee that i >= N after the loop terminates, but you will be tempted to assume that i == N.
If for any reason the increment ++i is ever changed to something like i += 2, or if i gets modified inside the loop, or if N is negative, the program can fail:
The first one may get stuck in an infinite loop. It fails early, in the loop that has the error. Debugging is easy.
The second loop will terminate, and at some later time the program may fail because of your incorrect assumption of i == N. It can fail far away from the loop that caused the bug, making it hard to trace back. Or it can silently continue doing something unexpected, which is even worse.
Which termination condition do you prefer, and why? Are there other considerations? Why do many programmers who know this, refuse to apply it?
I tend to use the second form, simply because then I can be more sure that the loop will terminate. I.e. it's harder to introduce a non-termination bug by altering i inside the loop.
Of course, it also has the slightly lazy advantage of being one less character to type ;)
I would also argue, that in a language with sensible scope rules, as i is declared inside the loop construct, it shouldn't be available outside the loop. This would mitigate any reliance on i being equal to N at the end of the loop...
We shouldn't look at the counter in isolation - if for any reason someone changed the way the counter is incremented they would change the termination conditions and the resulting logic if it's required for i==N.
I would prefer the the second condition since it's more standard and will not result in endless loop.
In C++, using the != test is preferred for generality. Iterators in C++ have various concepts, like input iterator, forward iterator, bidirectional iterator, random access iterator, each of which extends the previous one with new capabilities. For < to work, random access iterator is required, whereas != merely requires input iterator.
If you trust your code, you can do either.
If you want your code to be readable and easily understood (and thus more tolerant to change from someone who you've got to assume to be a klutz), I'd use something like;
for ( int i = 0 ; i >= 0 && i < N ; ++i)
I always use #2 as then you can be sure the loop will terminate... Relying on it being equal to N afterwards is relying on a side effect... Wouldn't you just be better using the variable N itself?
[edit] Sorry...I meant #2
I think most programmers use the 2nd one, because it helps figure out what goes on inside the loop. I can look at it, and "know" that i will start as 0, and will definitely be less than N.
The 1st variant doesn't have this quality. I can look at it, and all I know is that i will start as 0 and that it won't ever be equal to N. Not quite as helpful.
Irrespective of how you terminate the loop, it is always good to be very wary of using a loop control variable outside the loop. In your examples you (correctly) declare i inside the loop, so it is not in scope outside the loop and the question of its value is moot...
Of course, the 2nd variant also has the advantage that it's what all of the C references I have seen use :-)
In general I would prefer
for ( int i = 0; i < N; ++i )
The punishment for a buggy program in production, seems a lot less severe, you will not have a thread stuck forever in a for loop, a situation that can be very risky and very hard to diagnose.
Also, in general I like to avoid these kind of loops in favour of the more readable foreach style loops.
I prefer to use #2, only because I try not to extend the meaning of i outside of the for loop. If I were tracking a variable like that, I would create an additional test. Some may say this is redundant or inefficient, but it reminds the reader of my intent: At this point, i must equal N
#timyates - I agree one shouldn't rely on side-effects
I think you stated very well the difference between the two. I do have the following comments, though:
This is not "language-agnostic", I can see your examples are in C++ but there
are languages where you are not allowed to modify the loop variable inside the
loop and others that don't guarantee that the value of the index is usable after
the loop (and some do both).
You have declared the i
index within the for so I would not bet on the value of i after the loop.
The examples are a little bit misleading as they implictly assume that for is
a definite loop. In reality it is just a more convenient way of writing:
// version 1
{ int i = 0;
while (i != N) {
...
++i;
}
}
Note how i is undefined after the block.
If a programmer knew all of the above would not make general assumption of the value of i and would be wise enough to choose i<N as the ending conditions, to ensure that the the exit condition will be eventually met.
Using either of the above in c# would cause a compiler error if you used i outside the loop
I prefer this sometimes:
for (int i = 0; (i <= (n-1)); i++) { ... }
This version shows directly the range of values that i can have. My take on checking lower and upper bound of the range is that if you really need this, your code has too many side effects and needs to be rewritten.
The other version:
for (int i = 1; (i <= n); i++) { ... }
helps you determine how often the loop body is called. This also has valid use cases.
For general programming work I prefer
for ( int i = 0; i < N; ++i )
to
for ( int i = 0; i != N; ++i )
Because it is less error prone, especially when code gets refactored. I have seen this kind of code turned into an infinite loop by accident.
That argument made that "you will be tempted to assume that i == N", I don't believe is true. I have never made that assumption or seen another programmer make it.
From my standpoint of formal verification and automatic termination analysis, I strongly prefer #2 (<). It is quite easy to track that some variable is increased (before var = x, after var = x+n for some non-negative number n). However, it is not that easy to see that i==N eventually holds. For this, one needs to infer that i is increased by exactly 1 in each step, which (in more complicated examples) might be lost due to abstraction.
If you think about the loop which increments by two (i = i + 2), this general idea becomes more understandable. To guarantee termination one now needs to know that i%2 == N%2, whereas this is irrelevant when using < as the condition.