How can a compiler apply function elimination to impure functions?

How can a compiler apply function elimination to impure functions? - function

Often times when writing code, I find myself using a value from a particular function call multiple times. I realized that an obvious optimization would be to capture these repeatedly used values in variables.
This (pseudo code):
function add1(foo){ foo + 1; }
...
do_something(foo(1));
do_something_else(foo(1));
Becomes:
function add1(foo){ foo + 1; }
...
bar = foo(1);
do_something(bar);
do_something_else(bar);
However, doing this explicitly makes code less readable in my experience. I assumed that compilers could not do this kind of optimization if our language of choice allows functions to have side-effects.
Recently I looked into this, and if I understand correctly, this optimization is/can be done for languages where functions must be pure. That does not surprise me, but supposedly this can also be done for impure functions. With a few quick Google searches I found these snippets:
GCC 4.7 Fortran improvement
When performing front-end-optimization, the -faggressive-function-elimination option allows the removal of duplicate function calls even for impure functions.
Compiler Optimization (Wikipedia)
For example, in some languages functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can immediately infer that the function's result need be computed only once. In languages where functions are allowed to have side effects, another strategy is possible. The optimizer can determine which function has no side effects, and restrict such optimizations to side effect free functions. This optimization is only possible when the optimizer has access to the called function.
From my understanding, this means that an optimizer can determine when a function is or is not pure, and perform this optimization when the function is. I say this because if a function always produces the same output when given the same input, and is side effect free, it would fulfill both conditions to be considered pure.
These two snippets raise two questions for me.
How can a compiler be able to safely make this optimization if a function is not pure? (as in -faggressive-function-elimination)
How can a compiler determine whether a function is pure or not? (as in the strategy suggested in the Wikipedia article)
and finally:
Can this kind of optimization be applied to any language, or only when certain conditions are met?
Is this optimization a worthwhile one even for extremely simple functions?
How much overhead does storing and retrieving a value from the stack incur?
I apologize if these are stupid or illogical questions. They are just some things I have been curious about lately. :)

Disclaimer: I'm not a compiler/optimizer guy, I only have a tendency to peek at the generated code, and like to read about that stuff - so that's not autorative. A quick search didn't turn up much on -faggressive-function-elimination, so it might do some extra magic not explained here.
An optimizer can
attempt to inline the function call (with link time code generation, this works even across compilation units)
perform common subexpression elimination, and, possibly, side effect reordering.
Modifying your example a bit, and doing it in C++:
extern volatile int RW_A = 0; // see note below
int foo(int a) { return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
}
Resolves to (pseudocode)
<register> = 4;
RW_A = register;
RW_A = register;
(Note: reading from and writing to a volatile variable is an "observable side effect", that the optimizer must preserve in the same order given by the code.)
Modifying the example for foo to have a side effect:
extern volatile int RW_A = 0;
extern volatile int RW_B = 0;
int accu = 1;
int foo(int a) { accu *= 2; return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
RW_B = accu;
return 0;
}
generates the following pseudocode:
registerA = accu;
registerA += registerA;
accu = registerA;
registerA += registerA;
registerC = 4;
accu = registerA;
RW_A = registerC;
RW_A = registerC;
RW_B = registerA;
We observe that common subexpression elimination is still done, and separated from the side effects. Inlining and reordering allows to separate the side effects from the "pure" part.
Note that the compiler reads and eagerly writes back to accu, which wouldn't be necessary. I'm not sure on the rationale here.
To conclude:
A compiler does not need to test for purity. It can identify side effects that need to be preserved, and then transform the rest to its liking.
Such optimizations are worthwhile, even for trivial functions, because, among others,
CPU and memory are shared resources, what you use you might take away from someone else
Battery life
Minor optimizations may be gateways to larger ones
The overhead for a stack memory access is usually ~1 cycle, since the top of stack is usually in Level 1 cache already. Note that the usually should be in bold: it can be "even better", since the read / write may be optimized away, or it can be worse since the increased pressure on L1 cache flushes some other important data back to L2.
Where's the limit?
Theoretically, compile time. In practice, predictability and correctness of the optimizer are additional tradeoffs.
All tests with VC2008, default optimization settings for "Release" build.

Related

How to reuse code for CPU fallback in CUDA

I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.
For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...
__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
*c = *a + *b;
}
... then modified the kernel to call the aforementioned function on the element identified by threadIdx...
__global__ void addKernel(int* c, const int* a, const int* b)
{
const unsigned i = threadIdx.x;
addSingleItem(c + i, a + i, b + i);
}
So now my application can check for the presence of a CUDA device. If one is found I can use...
addKernel <<<1, size>>> (dev_c, dev_a, dev_b);
... and if not I can forego parallelization and iterate through the elements calling the host version of the function...
int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;
for (int i = 0; i < arraySize; i++)
{
addSingleItem(pC++, pA++, pB++);
}
Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?

If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.
That is undoubtedly the best approach if it can be managed, for the reason described above.
If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.
Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.
Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.
This question may also be of interest, although only tangentially related here.

OpenMP parallelize for loop inside a function

I am trying to parallelize this for loop inside a function using OpenMP, but when I compile the code I still have an error =(
Error 1 error C3010: 'return' : jump out of OpenMP structured block not allowed.
I am using Visual studio 2010 C++ compiler. Can anyone help me? I appreciate any advice.
int match(char* pattern, int patternSize, char* string, int startFrom, unsigned int &comparisons) {
comparisons = 0;
#pragma omp for
for (int i = 0; i < patternSize; i++){
comparisons++;
if (pattern[i] != string[i + startFrom])
return 0;
}
return 1;
}

As #Hristo has already mentioned, you are not allowed to branch out of a parallel region in OpenMP. Among other reasons, this is not allowed because the compiler cannot know a priori how many iterations each thread should work on when it splits up a for loop like the one that you have written among the different threads.
Furthermore, even if you could branch out of your loop, you should be able to see that comparisons would be computed incorrectly. As is, you have an inherently serial algorithm that breaks at the first different character. How could you split up this work such that throwing more threads at this algorithm possibly makes it faster?
Finally, note that there is very little work being done in this loop anyway. You would be very unlikely to see any benefit from OpenMP even if you could rewrite this algorithm into a parallel algorithm. My suggestion: drop OpenMP from this loop and look to implement it somewhere else (either at a higher level - maybe you call this method on different strings? - or in a section of your code that does more work).

What's So Good About Recursion? [duplicate]

Is there a performance hit if we use a loop instead of recursion or vice versa in algorithms where both can serve the same purpose? Eg: Check if the given string is a palindrome.
I have seen many programmers using recursion as a means to show off when a simple iteration algorithm can fit the bill.
Does the compiler play a vital role in deciding what to use?

Loops may achieve a performance gain for your program. Recursion may achieve a performance gain for your programmer. Choose which is more important in your situation!

It is possible that recursion will be more expensive, depending on if the recursive function is tail recursive (the last line is recursive call). Tail recursion should be recognized by the compiler and optimized to its iterative counterpart (while maintaining the concise, clear implementation you have in your code).
I would write the algorithm in the way that makes the most sense and is the clearest for the poor sucker (be it yourself or someone else) that has to maintain the code in a few months or years. If you run into performance issues, then profile your code, and then and only then look into optimizing by moving over to an iterative implementation. You may want to look into memoization and dynamic programming.

Comparing recursion to iteration is like comparing a phillips head screwdriver to a flat head screwdriver. For the most part you could remove any phillips head screw with a flat head, but it would just be easier if you used the screwdriver designed for that screw right?
Some algorithms just lend themselves to recursion because of the way they are designed (Fibonacci sequences, traversing a tree like structure, etc.). Recursion makes the algorithm more succinct and easier to understand (therefore shareable and reusable).
Also, some recursive algorithms use "Lazy Evaluation" which makes them more efficient than their iterative brothers. This means that they only do the expensive calculations at the time they are needed rather than each time the loop runs.
That should be enough to get you started. I'll dig up some articles and examples for you too.
Link 1: Haskel vs PHP (Recursion vs Iteration)
Here is an example where the programmer had to process a large data set using PHP. He shows how easy it would have been to deal with in Haskel using recursion, but since PHP had no easy way to accomplish the same method, he was forced to use iteration to get the result.
http://blog.webspecies.co.uk/2011-05-31/lazy-evaluation-with-php.html
Link 2: Mastering Recursion
Most of recursion's bad reputation comes from the high costs and inefficiency in imperative languages. The author of this article talks about how to optimize recursive algorithms to make them faster and more efficient. He also goes over how to convert a traditional loop into a recursive function and the benefits of using tail-end recursion. His closing words really summed up some of my key points I think:
"recursive programming gives the programmer a better way of organizing
code in a way that is both maintainable and logically consistent."
https://developer.ibm.com/articles/l-recurs/
Link 3: Is recursion ever faster than looping? (Answer)
Here is a link to an answer for a stackoverflow question that is similar to yours. The author points out that a lot of the benchmarks associated with either recursing or looping are very language specific. Imperative languages are typically faster using a loop and slower with recursion and vice-versa for functional languages. I guess the main point to take from this link is that it is very difficult to answer the question in a language agnostic / situation blind sense.
Is recursion ever faster than looping?

Recursion is more costly in memory, as each recursive call generally requires a memory address to be pushed to the stack - so that later the program could return to that point.
Still, there are many cases in which recursion is a lot more natural and readable than loops - like when working with trees. In these cases I would recommend sticking to recursion.

Typically, one would expect the performance penalty to lie in the other direction. Recursive calls can lead to the construction of extra stack frames; the penalty for this varies. Also, in some languages like Python (more correctly, in some implementations of some languages...), you can run into stack limits rather easily for tasks you might specify recursively, such as finding the maximum value in a tree data structure. In these cases, you really want to stick with loops.
Writing good recursive functions can reduce the performance penalty somewhat, assuming you have a compiler that optimizes tail recursions, etc. (Also double check to make sure that the function really is tail recursive---it's one of those things that many people make mistakes on.)
Apart from "edge" cases (high performance computing, very large recursion depth, etc.), it's preferable to adopt the approach that most clearly expresses your intent, is well-designed, and is maintainable. Optimize only after identifying a need.

Recursion is better than iteration for problems that can be broken down into multiple, smaller pieces.
For example, to make a recursive Fibonnaci algorithm, you break down fib(n) into fib(n-1) and fib(n-2) and compute both parts. Iteration only allows you to repeat a single function over and over again.
However, Fibonacci is actually a broken example and I think iteration is actually more efficient. Notice that fib(n) = fib(n-1) + fib(n-2) and fib(n-1) = fib(n-2) + fib(n-3). fib(n-1) gets calculated twice!
A better example is a recursive algorithm for a tree. The problem of analyzing the parent node can be broken down into multiple smaller problems of analyzing each child node. Unlike the Fibonacci example, the smaller problems are independent of each other.
So yeah - recursion is better than iteration for problems that can be broken down into multiple, smaller, independent, similar problems.

Your performance deteriorates when using recursion because calling a method, in any language, implies a lot of preparation: the calling code posts a return address, call parameters, some other context information such as processor registers might be saved somewhere, and at return time the called method posts a return value which is then retrieved by the caller, and any context information that was previously saved will be restored. the performance diff between an iterative and a recursive approach lies in the time these operations take.
From an implementation point of view, you really start noticing the difference when the time it takes to handle the calling context is comparable to the time it takes for your method to execute. If your recursive method takes longer to execute then the calling context management part, go the recursive way as the code is generally more readable and easy to understand and you won't notice the performance loss. Otherwise go iterative for efficiency reasons.

I believe tail recursion in java is not currently optimized. The details are sprinkled throughout this discussion on LtU and the associated links. It may be a feature in the upcoming version 7, but apparently it presents certain difficulties when combined with Stack Inspection since certain frames would be missing. Stack Inspection has been used to implement their fine-grained security model since Java 2.
http://lambda-the-ultimate.org/node/1333

There are many cases where it gives a much more elegant solution over the iterative method, the common example being traversal of a binary tree, so it isn't necessarily more difficult to maintain. In general, iterative versions are usually a bit faster (and during optimization may well replace a recursive version), but recursive versions are simpler to comprehend and implement correctly.

Recursion is very useful is some situations. For example consider the code for finding the factorial
int factorial ( int input )
{
int x, fact = 1;
for ( x = input; x > 1; x--)
fact *= x;
return fact;
}
Now consider it by using the recursive function
int factorial ( int input )
{
if (input == 0)
{
return 1;
}
return input * factorial(input - 1);
}
By observing these two, we can see that recursion is easy to understand.
But if it is not used with care it can be so much error prone too.
Suppose if we miss if (input == 0), then the code will be executed for some time and ends with usually a stack overflow.

In many cases recursion is faster because of caching, which improves performance. For example, here is an iterative version of merge sort using the traditional merge routine. It will run slower than the recursive implementation because of caching improved performances.
Iterative implementation
public static void sort(Comparable[] a)
{
int N = a.length;
aux = new Comparable[N];
for (int sz = 1; sz < N; sz = sz+sz)
for (int lo = 0; lo < N-sz; lo += sz+sz)
merge(a, lo, lo+sz-1, Math.min(lo+sz+sz-1, N-1));
}
Recursive implementation
private static void sort(Comparable[] a, Comparable[] aux, int lo, int hi)
{
if (hi <= lo) return;
int mid = lo + (hi - lo) / 2;
sort(a, aux, lo, mid);
sort(a, aux, mid+1, hi);
merge(a, aux, lo, mid, hi);
}
PS - this is what was told by Professor Kevin Wayne (Princeton University) on the course on algorithms presented on Coursera.

Using recursion, you're incurring the cost of a function call with each "iteration", whereas with a loop, the only thing you usually pay is an increment/decrement. So, if the code for the loop isn't much more complicated than the code for the recursive solution, loop will usually be superior to recursion.

Recursion and iteration depends on the business logic that you want to implement, though in most of the cases it can be used interchangeably. Most developers go for recursion because it is easier to understand.

It depends on the language. In Java you should use loops. Functional languages optimize recursion.

Recursion has a disadvantage that the algorithm that you write using recursion has O(n) space complexity.
While iterative aproach have a space complexity of O(1).This is the advantange of using iteration over recursion.
Then why do we use recursion?
See below.
Sometimes it is easier to write an algorithm using recursion while it's slightly tougher to write the same algorithm using iteration.In this case if you opt to follow the iteration approach you would have to handle stack yourself.

If you're just iterating over a list, then sure, iterate away.
A couple of other answers have mentioned (depth-first) tree traversal. It really is such a great example, because it's a very common thing to do to a very common data structure. Recursion is extremely intuitive for this problem.
Check out the "find" methods here:
http://penguin.ewu.edu/cscd300/Topic/BSTintro/index.html

Recursion is more simple (and thus - more fundamental) than any possible definition of an iteration. You can define a Turing-complete system with only a pair of combinators (yes, even a recursion itself is a derivative notion in such a system). Lambda calculus is an equally powerful fundamental system, featuring recursive functions. But if you want to define an iteration properly, you'd need much more primitives to start with.
As for the code - no, recursive code is in fact much easier to understand and to maintain than a purely iterative one, since most data structures are recursive. Of course, in order to get it right one would need a language with a support for high order functions and closures, at least - to get all the standard combinators and iterators in a neat way. In C++, of course, complicated recursive solutions can look a bit ugly, unless you're a hardcore user of FC++ and alike.

I would think in (non tail) recursion there would be a performance hit for allocating a new stack etc every time the function is called (dependent on language of course).

it depends on "recursion depth".
it depends on how much the function call overhead will influence the total execution time.
For example, calculating the classical factorial in a recursive way is very inefficient due to:
- risk of data overflowing
- risk of stack overflowing
- function call overhead occupy 80% of execution time
while developing a min-max algorithm for position analysis in the game of chess that will analyze subsequent N moves can be implemented in recursion over the "analysis depth" (as I'm doing ^_^)

Recursion? Where do I start, wiki will tell you “it’s the process of repeating items in a self-similar way"
Back in day when I was doing C, C++ recursion was a god send, stuff like "Tail recursion". You'll also find many sorting algorithms use recursion. Quick sort example: http://alienryderflex.com/quicksort/
Recursion is like any other algorithm useful for a specific problem. Perhaps you mightn't find a use straight away or often but there will be problem you’ll be glad it’s available.

In C++ if the recursive function is a templated one, then the compiler has more chance to optimize it, as all the type deduction and function instantiations will occur in compile time. Modern compilers can also inline the function if possible. So if one uses optimization flags like -O3 or -O2 in g++, then recursions may have the chance to be faster than iterations. In iterative codes, the compiler gets less chance to optimize it, as it is already in the more or less optimal state (if written well enough).
In my case, I was trying to implement matrix exponentiation by squaring using Armadillo matrix objects, in both recursive and iterative way. The algorithm can be found here... https://en.wikipedia.org/wiki/Exponentiation_by_squaring.
My functions were templated and I have calculated 1,000,000 12x12 matrices raised to the power 10. I got the following result:
iterative + optimisation flag -O3 -> 2.79.. sec
recursive + optimisation flag -O3 -> 1.32.. sec
iterative + No-optimisation flag -> 2.83.. sec
recursive + No-optimisation flag -> 4.15.. sec
These results have been obtained using gcc-4.8 with c++11 flag (-std=c++11) and Armadillo 6.1 with Intel mkl. Intel compiler also shows similar results.

Mike is correct. Tail recursion is not optimized out by the Java compiler or the JVM. You will always get a stack overflow with something like this:
int count(int i) {
return i >= 100000000 ? i : count(i+1);
}

You have to keep in mind that utilizing too deep recursion you will run into Stack Overflow, depending on allowed stack size. To prevent this make sure to provide some base case which ends you recursion.

Using just Chrome 45.0.2454.85 m, recursion seems to be a nice amount faster.
Here is the code:
(function recursionVsForLoop(global) {
"use strict";
// Perf test
function perfTest() {}
perfTest.prototype.do = function(ns, fn) {
console.time(ns);
fn();
console.timeEnd(ns);
};
// Recursion method
(function recur() {
var count = 0;
global.recurFn = function recurFn(fn, cycles) {
fn();
count = count + 1;
if (count !== cycles) recurFn(fn, cycles);
};
})();
// Looped method
function loopFn(fn, cycles) {
for (var i = 0; i < cycles; i++) {
fn();
}
}
// Tests
var curTest = new perfTest(),
testsToRun = 100;
curTest.do('recursion', function() {
recurFn(function() {
console.log('a recur run.');
}, testsToRun);
});
curTest.do('loop', function() {
loopFn(function() {
console.log('a loop run.');
}, testsToRun);
});
})(window);
RESULTS
// 100 runs using standard for loop
100x for loop run.
Time to complete: 7.683ms
// 100 runs using functional recursive approach w/ tail recursion
100x recursion run.
Time to complete: 4.841ms
In the screenshot below, recursion wins again by a bigger margin when run at 300 cycles per test

If the iterations are atomic and orders of magnitude more expensive than pushing a new stack frame and creating a new thread and you have multiple cores and your runtime environment can use all of them, then a recursive approach could yield a huge performance boost when combined with multithreading. If the average number of iterations is not predictable then it might be a good idea to use a thread pool which will control thread allocation and prevent your process from creating too many threads and hogging the system.
For example, in some languages, there are recursive multithreaded merge sort implementations.
But again, multithreading can be used with looping rather than recursion, so how well this combination will work depends on more factors including the OS and its thread allocation mechanism.

I found another differences between those approaches.
It looks simple and unimportant, but it has a very important role while you prepare for interviews and this subject arises, so look closely.
In short:
1) iterative post-order traversal is not easy - that makes DFT more complex
2) cycles check easier with recursion
Details:
In the recursive case, it is easy to create pre and post traversals:
Imagine a pretty standard question: "print all tasks that should be executed to execute the task 5, when tasks depend on other tasks"
Example:
//key-task, value-list of tasks the key task depends on
//"adjacency map":
Map<Integer, List<Integer>> tasksMap = new HashMap<>();
tasksMap.put(0, new ArrayList<>());
tasksMap.put(1, new ArrayList<>());
List<Integer> t2 = new ArrayList<>();
t2.add(0);
t2.add(1);
tasksMap.put(2, t2);
List<Integer> t3 = new ArrayList<>();
t3.add(2);
t3.add(10);
tasksMap.put(3, t3);
List<Integer> t4 = new ArrayList<>();
t4.add(3);
tasksMap.put(4, t4);
List<Integer> t5 = new ArrayList<>();
t5.add(3);
tasksMap.put(5, t5);
tasksMap.put(6, new ArrayList<>());
tasksMap.put(7, new ArrayList<>());
List<Integer> t8 = new ArrayList<>();
t8.add(5);
tasksMap.put(8, t8);
List<Integer> t9 = new ArrayList<>();
t9.add(4);
tasksMap.put(9, t9);
tasksMap.put(10, new ArrayList<>());
//task to analyze:
int task = 5;
List<Integer> res11 = getTasksInOrderDftReqPostOrder(tasksMap, task);
System.out.println(res11);**//note, no reverse required**
List<Integer> res12 = getTasksInOrderDftReqPreOrder(tasksMap, task);
Collections.reverse(res12);//note reverse!
System.out.println(res12);
private static List<Integer> getTasksInOrderDftReqPreOrder(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
reqPreOrder(tasksMap,task,result, visited);
return result;
}
private static void reqPreOrder(Map<Integer, List<Integer>> tasksMap, int task, List<Integer> result, Set<Integer> visited) {
if(!visited.contains(task)) {
visited.add(task);
result.add(task);//pre order!
List<Integer> children = tasksMap.get(task);
if (children != null && children.size() > 0) {
for (Integer child : children) {
reqPreOrder(tasksMap,child,result, visited);
}
}
}
}
private static List<Integer> getTasksInOrderDftReqPostOrder(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
reqPostOrder(tasksMap,task,result, visited);
return result;
}
private static void reqPostOrder(Map<Integer, List<Integer>> tasksMap, int task, List<Integer> result, Set<Integer> visited) {
if(!visited.contains(task)) {
visited.add(task);
List<Integer> children = tasksMap.get(task);
if (children != null && children.size() > 0) {
for (Integer child : children) {
reqPostOrder(tasksMap,child,result, visited);
}
}
result.add(task);//post order!
}
}
Note that the recursive post-order-traversal does not require a subsequent reversal of the result. Children printed first and your task in the question printed last. Everything is fine. You can do a recursive pre-order-traversal (also shown above) and that one will require a reversal of the result list.
Not that simple with iterative approach! In iterative (one stack) approach you can only do a pre-ordering-traversal, so you obliged to reverse the result array at the end:
List<Integer> res1 = getTasksInOrderDftStack(tasksMap, task);
Collections.reverse(res1);//note reverse!
System.out.println(res1);
private static List<Integer> getTasksInOrderDftStack(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
Stack<Integer> st = new Stack<>();
st.add(task);
visited.add(task);
while(!st.isEmpty()){
Integer node = st.pop();
List<Integer> children = tasksMap.get(node);
result.add(node);
if(children!=null && children.size() > 0){
for(Integer child:children){
if(!visited.contains(child)){
st.add(child);
visited.add(child);
}
}
}
//If you put it here - it does not matter - it is anyway a pre-order
//result.add(node);
}
return result;
}
Looks simple, no?
But it is a trap in some interviews.
It means the following: with the recursive approach, you can implement Depth First Traversal and then select what order you need pre or post(simply by changing the location of the "print", in our case of the "adding to the result list"). With the iterative (one stack) approach you can easily do only pre-order traversal and so in the situation when children need be printed first(pretty much all situations when you need start print from the bottom nodes, going upwards) - you are in the trouble. If you have that trouble you can reverse later, but it will be an addition to your algorithm. And if an interviewer is looking at his watch it may be a problem for you. There are complex ways to do an iterative post-order traversal, they exist, but they are not simple. Example:https://www.geeksforgeeks.org/iterative-postorder-traversal-using-stack/
Thus, the bottom line: I would use recursion during interviews, it is simpler to manage and to explain. You have an easy way to go from pre to post-order traversal in any urgent case. With iterative you are not that flexible.
I would use recursion and then tell: "Ok, but iterative can provide me more direct control on used memory, I can easily measure the stack size and disallow some dangerous overflow.."
Another plus of recursion - it is simpler to avoid / notice cycles in a graph.
Example (preudocode):
dft(n){
mark(n)
for(child: n.children){
if(marked(child))
explode - cycle found!!!
dft(child)
}
unmark(n)
}

It may be fun to write it as recursion, or as a practice.
However, if the code is to be used in production, you need to consider the possibility of stack overflow.
Tail recursion optimization can eliminate stack overflow, but do you want to go through the trouble of making it so, and you need to know you can count on it having the optimization in your environment.
Every time the algorithm recurses, how much is the data size or n reduced by?
If you are reducing the size of data or n by half every time you recurse, then in general you don't need to worry about stack overflow. Say, if it needs to be 4,000 level deep or 10,000 level deep for the program to stack overflow, then your data size need to be roughly 24000 for your program to stack overflow. To put that into perspective, a biggest storage device recently can hold 261 bytes, and if you have 261 of such devices, you are only dealing with 2122 data size. If you are looking at all the atoms in the universe, it is estimated that it may be less than 284. If you need to deal with all the data in the universe and their states for every millisecond since the birth of the universe estimated to be 14 billion years ago, it may only be 2153. So if your program can handle 24000 units of data or n, you can handle all data in the universe and the program will not stack overflow. If you don't need to deal with numbers that are as big as 24000 (a 4000-bit integer), then in general you don't need to worry about stack overflow.
However, if you reduce the size of data or n by a constant amount every time you recurse, then you can run into stack overflow when n becomes merely 20000. That is, the program runs well when n is 1000, and you think the program is good, and then the program stack overflows when some time in the future, when n is 5000 or 20000.
So if you have a possibility of stack overflow, try to make it an iterative solution.

As far as I know, Perl does not optimize tail-recursive calls, but you can fake it.
sub f{
my($l,$r) = #_;
if( $l >= $r ){
return $l;
} else {
# return f( $l+1, $r );
#_ = ( $l+1, $r );
goto &f;
}
}
When first called it will allocate space on the stack. Then it will change its arguments, and restart the subroutine, without adding anything more to the stack. It will therefore pretend that it never called its self, changing it into an iterative process.
Note that there is no "my #_;" or "local #_;", if you did it would no longer work.

"Is there a performance hit if we use a loop instead of
recursion or vice versa in algorithms where both can serve the same purpose?"
Usually yes if you are writing in a imperative language iteration will run faster than recursion, the performance hit is minimized in problems where the iterative solution requires manipulating Stacks and popping items off of a stack due to the recursive nature of the problem. There are a lot of times where the recursive implementation is much easier to read because the code is much shorter,
so you do want to consider maintainability. Especailly in cases where the problem has a recursive nature. So take for example:
The recursive implementation of Tower of Hanoi:
def TowerOfHanoi(n , source, destination, auxiliary):
if n==1:
print ("Move disk 1 from source",source,"to destination",destination)
return
TowerOfHanoi(n-1, source, auxiliary, destination)
print ("Move disk",n,"from source",source,"to destination",destination)
TowerOfHanoi(n-1, auxiliary, destination, source)
Fairly short and pretty easy to read. Compare this with its Counterpart iterative TowerOfHanoi:
# Python3 program for iterative Tower of Hanoi
import sys
# A structure to represent a stack
class Stack:
# Constructor to set the data of
# the newly created tree node
def __init__(self, capacity):
self.capacity = capacity
self.top = -1
self.array = [0]*capacity
# function to create a stack of given capacity.
def createStack(capacity):
stack = Stack(capacity)
return stack
# Stack is full when top is equal to the last index
def isFull(stack):
return (stack.top == (stack.capacity - 1))
# Stack is empty when top is equal to -1
def isEmpty(stack):
return (stack.top == -1)
# Function to add an item to stack.
# It increases top by 1
def push(stack, item):
if(isFull(stack)):
return
stack.top+=1
stack.array[stack.top] = item
# Function to remove an item from stack.
# It decreases top by 1
def Pop(stack):
if(isEmpty(stack)):
return -sys.maxsize
Top = stack.top
stack.top-=1
return stack.array[Top]
# Function to implement legal
# movement between two poles
def moveDisksBetweenTwoPoles(src, dest, s, d):
pole1TopDisk = Pop(src)
pole2TopDisk = Pop(dest)
# When pole 1 is empty
if (pole1TopDisk == -sys.maxsize):
push(src, pole2TopDisk)
moveDisk(d, s, pole2TopDisk)
# When pole2 pole is empty
else if (pole2TopDisk == -sys.maxsize):
push(dest, pole1TopDisk)
moveDisk(s, d, pole1TopDisk)
# When top disk of pole1 > top disk of pole2
else if (pole1TopDisk > pole2TopDisk):
push(src, pole1TopDisk)
push(src, pole2TopDisk)
moveDisk(d, s, pole2TopDisk)
# When top disk of pole1 < top disk of pole2
else:
push(dest, pole2TopDisk)
push(dest, pole1TopDisk)
moveDisk(s, d, pole1TopDisk)
# Function to show the movement of disks
def moveDisk(fromPeg, toPeg, disk):
print("Move the disk", disk, "from '", fromPeg, "' to '", toPeg, "'")
# Function to implement TOH puzzle
def tohIterative(num_of_disks, src, aux, dest):
s, d, a = 'S', 'D', 'A'
# If number of disks is even, then interchange
# destination pole and auxiliary pole
if (num_of_disks % 2 == 0):
temp = d
d = a
a = temp
total_num_of_moves = int(pow(2, num_of_disks) - 1)
# Larger disks will be pushed first
for i in range(num_of_disks, 0, -1):
push(src, i)
for i in range(1, total_num_of_moves + 1):
if (i % 3 == 1):
moveDisksBetweenTwoPoles(src, dest, s, d)
else if (i % 3 == 2):
moveDisksBetweenTwoPoles(src, aux, s, a)
else if (i % 3 == 0):
moveDisksBetweenTwoPoles(aux, dest, a, d)
# Input: number of disks
num_of_disks = 3
# Create three stacks of size 'num_of_disks'
# to hold the disks
src = createStack(num_of_disks)
dest = createStack(num_of_disks)
aux = createStack(num_of_disks)
tohIterative(num_of_disks, src, aux, dest)
Now the first one is way easier to read because suprise suprise shorter code is usually easier to understand than code that is 10 times longer. Sometimes you want to ask yourself is the extra performance gain really worth it? The amount of hours wasted debugging the code. Is the iterative TowerOfHanoi faster than the Recursive TowerOfHanoi? Probably, but not by a big margin. Would I like to program Recursive problems like TowerOfHanoi using iteration? Hell no. Next we have another recursive function the Ackermann function:
Using recursion:
if m == 0:
# BASE CASE
return n + 1
elif m > 0 and n == 0:
# RECURSIVE CASE
return ackermann(m - 1, 1)
elif m > 0 and n > 0:
# RECURSIVE CASE
return ackermann(m - 1, ackermann(m, n - 1))
Using Iteration:
callStack = [{'m': 2, 'n': 3, 'indentation': 0, 'instrPtr': 'start'}]
returnValue = None
while len(callStack) != 0:
m = callStack[-1]['m']
n = callStack[-1]['n']
indentation = callStack[-1]['indentation']
instrPtr = callStack[-1]['instrPtr']
if instrPtr == 'start':
print('%sackermann(%s, %s)' % (' ' * indentation, m, n))
if m == 0:
# BASE CASE
returnValue = n + 1
callStack.pop()
continue
elif m > 0 and n == 0:
# RECURSIVE CASE
callStack[-1]['instrPtr'] = 'after first recursive case'
callStack.append({'m': m - 1, 'n': 1, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif m > 0 and n > 0:
# RECURSIVE CASE
callStack[-1]['instrPtr'] = 'after second recursive case, inner call'
callStack.append({'m': m, 'n': n - 1, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif instrPtr == 'after first recursive case':
returnValue = returnValue
callStack.pop()
continue
elif instrPtr == 'after second recursive case, inner call':
callStack[-1]['innerCallResult'] = returnValue
callStack[-1]['instrPtr'] = 'after second recursive case, outer call'
callStack.append({'m': m - 1, 'n': returnValue, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif instrPtr == 'after second recursive case, outer call':
returnValue = returnValue
callStack.pop()
continue
print(returnValue)
And once again I will argue that the recursive implementation is much easier to understand. So my conclusion is use recursion if the problem by nature is recursive and requires manipulating items in a stack.

I'm going to answer your question by designing a Haskell data structure by "induction", which is a sort of "dual" to recursion. And then I will show how this duality leads to nice things.
We introduce a type for a simple tree:
data Tree a = Branch (Tree a) (Tree a)
| Leaf a
deriving (Eq)
We can read this definition as saying "A tree is a Branch (which contains two trees) or is a leaf (which contains a data value)". So the leaf is a sort of minimal case. If a tree isn't a leaf, then it must be a compound tree containing two trees. These are the only cases.
Let's make a tree:
example :: Tree Int
example = Branch (Leaf 1)
(Branch (Leaf 2)
(Leaf 3))
Now, let's suppose we want to add 1 to each value in the tree. We can do this by calling:
addOne :: Tree Int -> Tree Int
addOne (Branch a b) = Branch (addOne a) (addOne b)
addOne (Leaf a) = Leaf (a + 1)
First, notice that this is in fact a recursive definition. It takes the data constructors Branch and Leaf as cases (and since Leaf is minimal and these are the only possible cases), we are sure that the function will terminate.
What would it take to write addOne in an iterative style? What will looping into an arbitrary number of branches look like?
Also, this kind of recursion can often be factored out, in terms of a "functor". We can make Trees into Functors by defining:
instance Functor Tree where fmap f (Leaf a) = Leaf (f a)
fmap f (Branch a b) = Branch (fmap f a) (fmap f b)
and defining:
addOne' = fmap (+1)
We can factor out other recursion schemes, such as the catamorphism (or fold) for an algebraic data type. Using a catamorphism, we can write:
addOne'' = cata go where
go (Leaf a) = Leaf (a + 1)
go (Branch a b) = Branch a b

How does recursion make the use of run-time memory unpredictable?

Quoting from Code Complete 2,
int Factorial( int number ) {
if ( number == 1 ) {
return 1;
}
else {
return number * Factorial( number - 1 );
}
}
In addition to being slow [1] and making
the use of run-time memory
unpredictable [2], the recursive version
of this routine is harder to
understand than the iterative version,
which follows:
int Factorial( int number ) {
int intermediateResult = 1;
for ( int factor = 2; factor <= number; factor++ ) {
intermediateResult = intermediateResult * factor;
}
return intermediateResult;
}
I think the slow part is because of the unnecessary function call overheads.
But how does recursion make the use of run-time memory unpredictable?
Can't we always predict how much memory would be needed (as we know when the recursion is supposed to end)? I think it would be as unpredictable as the iterative case, but not any more.

Because of the fact recursive methods call them selves repeatedly, the need lots of stack memory. Since the stack is limited, errors will occur if the stack memoy is exceeded.

Can't we always predict how much memory would be needed (as we know when the recursion is supposed to end)? I think it would be as unpredictable as the iterative case, but not any more.
No, not in the general case. See discussion about the halting problem for more background. Now, here's a recursive version of one of the problems posted there:
void u(int x) {
if (x != 1) {
u((x % 2 == 0) ? x/2 : 3*x+1);
}
}
It's even tail-recursive. Since you can't predict if this will even terminate normally, how can you predict how much memory is needed?

If the recursion level becomes too deep, you'll blow the call stack and eat up lots of memory in the process. This can happen if your number is a "large enough" value. Can you do worse than this? Yes, if your function allocates more objects with every recursion call.

Implementation techniques for FSM states

How do you go about implementing FSM(EDIT:Finite State Machine) states?
I usually think about an FSM like a set of functions,
a dispatcher,
and a thread as to indicate the current running state.
Meaning, I do blocking calls to functions/functors representing
states.
Just now I have implemented one in a different style,
where I still represent states with function(object)s, but the thread
just calls a state->step() method, which tries to return
as quickly as possible. In case the state has finished and a
transition should take place, it indicates that accordingly.
I would call this the 'polling' style since the functions mostly look
like:
void step()
{
if(!HaveReachedGoal)
{
doWhateverNecessary();
return; // get out as fast as possible
}
// ... test perhaps some more subgoals
indicateTransition();
}
I am aware that it is an FSM within an FSM.
It feels rather simplistic, but it has certain advantages.
While a thread being blocked, or held in some kind of
while (!CanGoForward)checkGoForward();
loop can be cumbersome and unwieldy,
the polling felt much easier to debug.
That's because the FSM object regains control after
every step, and putting out some debug info is a breeze.
Well I am deviating from my question:
How do you implement states of FSMs?

The state Design Pattern is an interesting way of implementing a FSM:
http://en.wikipedia.org/wiki/State_pattern
It's a very clean way of implementing the FSM but it can be messy depending on the complexity of your FSM (but not the amount of states). However, the advantages are that:
you eliminate code duplication (especially if/else statements)
It is easier to extend with new states
Your classes have better cohesion so all related logic is in one place - this should also make your code easier to writ tests for.
There is a Java and C++ implementation at this site:
http://www.vincehuston.org/dp/state.html

There’s always what I call the Flying Spaghetti Monster’s style of implementing FSMs (FSM-style FSMs): using lotsa gotos. For example:
state1:
do_something();
goto state2;
state2:
if (condition) goto state1;
else goto state3;
state3:
accept;
Very nice spaghetti code :-)

I did it as a table, a flat array in the memory, each cell is a state. Please have a look at the cvs source of the abandoned DFA project. For example:
class DFA {
DFA();
DFA(int mychar_groups,int mycharmap[256],int myi_state);
~DFA();
void add_trans(unsigned int from,char sym,unsigned int to);
void add_trans(unsigned int from,unsigned int symn,unsigned int to);
/*adds a transition between state from to state to*/
int add_state(bool accepting=false);
int to(int state, int symn);
int to(int state, char sym);
void set_char(char used_chars[],int);
void set_char(set<char> char_set);
vector<int > table; /*contains the table of the dfa itself*/
void normalize();
vector<unsigned int> char_map;
unsigned int char_groups; /*number of characters the DFA uses,
char_groups=0 means 1 character group is used*/
unsigned int i_state; /*initial state of the DFA*/
void switch_table_state(int first,int sec);
unsigned int num_states;
set<int > accepting_states;
};
But this was for a very specific need (matching regular expressions)

I remember my first FSM program. I wrote it in C with a very simple switch statement. Switching from one state to another or following through to the next state seemed natural.
Then I progressed to use a table lookup approach. I was able to write some very generic coding style using this approach. However, I was caught out a couple of times when the requirements changed and I have to support some extra events.
I have not written any FSMs lately. The last one I wrote was for a comms module in C++ where I used a "state design pattern" in conjunction with a "command pattern" (action).

If you are creating a complex state machine then you may want to check out SMC - the State Machine Compiler. This takes a textual representation of a state machine and compiles it into the language of your choice - it supports Java, C, C++, C#, Python, Ruby, Scala and many others.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008