What's So Good About Recursion? [duplicate] - language-agnostic

Is there a performance hit if we use a loop instead of recursion or vice versa in algorithms where both can serve the same purpose? Eg: Check if the given string is a palindrome.
I have seen many programmers using recursion as a means to show off when a simple iteration algorithm can fit the bill.
Does the compiler play a vital role in deciding what to use?

Loops may achieve a performance gain for your program. Recursion may achieve a performance gain for your programmer. Choose which is more important in your situation!

It is possible that recursion will be more expensive, depending on if the recursive function is tail recursive (the last line is recursive call). Tail recursion should be recognized by the compiler and optimized to its iterative counterpart (while maintaining the concise, clear implementation you have in your code).
I would write the algorithm in the way that makes the most sense and is the clearest for the poor sucker (be it yourself or someone else) that has to maintain the code in a few months or years. If you run into performance issues, then profile your code, and then and only then look into optimizing by moving over to an iterative implementation. You may want to look into memoization and dynamic programming.

Comparing recursion to iteration is like comparing a phillips head screwdriver to a flat head screwdriver. For the most part you could remove any phillips head screw with a flat head, but it would just be easier if you used the screwdriver designed for that screw right?
Some algorithms just lend themselves to recursion because of the way they are designed (Fibonacci sequences, traversing a tree like structure, etc.). Recursion makes the algorithm more succinct and easier to understand (therefore shareable and reusable).
Also, some recursive algorithms use "Lazy Evaluation" which makes them more efficient than their iterative brothers. This means that they only do the expensive calculations at the time they are needed rather than each time the loop runs.
That should be enough to get you started. I'll dig up some articles and examples for you too.
Link 1: Haskel vs PHP (Recursion vs Iteration)
Here is an example where the programmer had to process a large data set using PHP. He shows how easy it would have been to deal with in Haskel using recursion, but since PHP had no easy way to accomplish the same method, he was forced to use iteration to get the result.
http://blog.webspecies.co.uk/2011-05-31/lazy-evaluation-with-php.html
Link 2: Mastering Recursion
Most of recursion's bad reputation comes from the high costs and inefficiency in imperative languages. The author of this article talks about how to optimize recursive algorithms to make them faster and more efficient. He also goes over how to convert a traditional loop into a recursive function and the benefits of using tail-end recursion. His closing words really summed up some of my key points I think:
"recursive programming gives the programmer a better way of organizing
code in a way that is both maintainable and logically consistent."
https://developer.ibm.com/articles/l-recurs/
Link 3: Is recursion ever faster than looping? (Answer)
Here is a link to an answer for a stackoverflow question that is similar to yours. The author points out that a lot of the benchmarks associated with either recursing or looping are very language specific. Imperative languages are typically faster using a loop and slower with recursion and vice-versa for functional languages. I guess the main point to take from this link is that it is very difficult to answer the question in a language agnostic / situation blind sense.
Is recursion ever faster than looping?

Recursion is more costly in memory, as each recursive call generally requires a memory address to be pushed to the stack - so that later the program could return to that point.
Still, there are many cases in which recursion is a lot more natural and readable than loops - like when working with trees. In these cases I would recommend sticking to recursion.

Typically, one would expect the performance penalty to lie in the other direction. Recursive calls can lead to the construction of extra stack frames; the penalty for this varies. Also, in some languages like Python (more correctly, in some implementations of some languages...), you can run into stack limits rather easily for tasks you might specify recursively, such as finding the maximum value in a tree data structure. In these cases, you really want to stick with loops.
Writing good recursive functions can reduce the performance penalty somewhat, assuming you have a compiler that optimizes tail recursions, etc. (Also double check to make sure that the function really is tail recursive---it's one of those things that many people make mistakes on.)
Apart from "edge" cases (high performance computing, very large recursion depth, etc.), it's preferable to adopt the approach that most clearly expresses your intent, is well-designed, and is maintainable. Optimize only after identifying a need.

Recursion is better than iteration for problems that can be broken down into multiple, smaller pieces.
For example, to make a recursive Fibonnaci algorithm, you break down fib(n) into fib(n-1) and fib(n-2) and compute both parts. Iteration only allows you to repeat a single function over and over again.
However, Fibonacci is actually a broken example and I think iteration is actually more efficient. Notice that fib(n) = fib(n-1) + fib(n-2) and fib(n-1) = fib(n-2) + fib(n-3). fib(n-1) gets calculated twice!
A better example is a recursive algorithm for a tree. The problem of analyzing the parent node can be broken down into multiple smaller problems of analyzing each child node. Unlike the Fibonacci example, the smaller problems are independent of each other.
So yeah - recursion is better than iteration for problems that can be broken down into multiple, smaller, independent, similar problems.

Your performance deteriorates when using recursion because calling a method, in any language, implies a lot of preparation: the calling code posts a return address, call parameters, some other context information such as processor registers might be saved somewhere, and at return time the called method posts a return value which is then retrieved by the caller, and any context information that was previously saved will be restored. the performance diff between an iterative and a recursive approach lies in the time these operations take.
From an implementation point of view, you really start noticing the difference when the time it takes to handle the calling context is comparable to the time it takes for your method to execute. If your recursive method takes longer to execute then the calling context management part, go the recursive way as the code is generally more readable and easy to understand and you won't notice the performance loss. Otherwise go iterative for efficiency reasons.

I believe tail recursion in java is not currently optimized. The details are sprinkled throughout this discussion on LtU and the associated links. It may be a feature in the upcoming version 7, but apparently it presents certain difficulties when combined with Stack Inspection since certain frames would be missing. Stack Inspection has been used to implement their fine-grained security model since Java 2.
http://lambda-the-ultimate.org/node/1333

There are many cases where it gives a much more elegant solution over the iterative method, the common example being traversal of a binary tree, so it isn't necessarily more difficult to maintain. In general, iterative versions are usually a bit faster (and during optimization may well replace a recursive version), but recursive versions are simpler to comprehend and implement correctly.

Recursion is very useful is some situations. For example consider the code for finding the factorial
int factorial ( int input )
{
int x, fact = 1;
for ( x = input; x > 1; x--)
fact *= x;
return fact;
}
Now consider it by using the recursive function
int factorial ( int input )
{
if (input == 0)
{
return 1;
}
return input * factorial(input - 1);
}
By observing these two, we can see that recursion is easy to understand.
But if it is not used with care it can be so much error prone too.
Suppose if we miss if (input == 0), then the code will be executed for some time and ends with usually a stack overflow.

In many cases recursion is faster because of caching, which improves performance. For example, here is an iterative version of merge sort using the traditional merge routine. It will run slower than the recursive implementation because of caching improved performances.
Iterative implementation
public static void sort(Comparable[] a)
{
int N = a.length;
aux = new Comparable[N];
for (int sz = 1; sz < N; sz = sz+sz)
for (int lo = 0; lo < N-sz; lo += sz+sz)
merge(a, lo, lo+sz-1, Math.min(lo+sz+sz-1, N-1));
}
Recursive implementation
private static void sort(Comparable[] a, Comparable[] aux, int lo, int hi)
{
if (hi <= lo) return;
int mid = lo + (hi - lo) / 2;
sort(a, aux, lo, mid);
sort(a, aux, mid+1, hi);
merge(a, aux, lo, mid, hi);
}
PS - this is what was told by Professor Kevin Wayne (Princeton University) on the course on algorithms presented on Coursera.

Using recursion, you're incurring the cost of a function call with each "iteration", whereas with a loop, the only thing you usually pay is an increment/decrement. So, if the code for the loop isn't much more complicated than the code for the recursive solution, loop will usually be superior to recursion.

Recursion and iteration depends on the business logic that you want to implement, though in most of the cases it can be used interchangeably. Most developers go for recursion because it is easier to understand.

It depends on the language. In Java you should use loops. Functional languages optimize recursion.

Recursion has a disadvantage that the algorithm that you write using recursion has O(n) space complexity.
While iterative aproach have a space complexity of O(1).This is the advantange of using iteration over recursion.
Then why do we use recursion?
See below.
Sometimes it is easier to write an algorithm using recursion while it's slightly tougher to write the same algorithm using iteration.In this case if you opt to follow the iteration approach you would have to handle stack yourself.

If you're just iterating over a list, then sure, iterate away.
A couple of other answers have mentioned (depth-first) tree traversal. It really is such a great example, because it's a very common thing to do to a very common data structure. Recursion is extremely intuitive for this problem.
Check out the "find" methods here:
http://penguin.ewu.edu/cscd300/Topic/BSTintro/index.html

Recursion is more simple (and thus - more fundamental) than any possible definition of an iteration. You can define a Turing-complete system with only a pair of combinators (yes, even a recursion itself is a derivative notion in such a system). Lambda calculus is an equally powerful fundamental system, featuring recursive functions. But if you want to define an iteration properly, you'd need much more primitives to start with.
As for the code - no, recursive code is in fact much easier to understand and to maintain than a purely iterative one, since most data structures are recursive. Of course, in order to get it right one would need a language with a support for high order functions and closures, at least - to get all the standard combinators and iterators in a neat way. In C++, of course, complicated recursive solutions can look a bit ugly, unless you're a hardcore user of FC++ and alike.

I would think in (non tail) recursion there would be a performance hit for allocating a new stack etc every time the function is called (dependent on language of course).

it depends on "recursion depth".
it depends on how much the function call overhead will influence the total execution time.
For example, calculating the classical factorial in a recursive way is very inefficient due to:
- risk of data overflowing
- risk of stack overflowing
- function call overhead occupy 80% of execution time
while developing a min-max algorithm for position analysis in the game of chess that will analyze subsequent N moves can be implemented in recursion over the "analysis depth" (as I'm doing ^_^)

Recursion? Where do I start, wiki will tell you “it’s the process of repeating items in a self-similar way"
Back in day when I was doing C, C++ recursion was a god send, stuff like "Tail recursion". You'll also find many sorting algorithms use recursion. Quick sort example: http://alienryderflex.com/quicksort/
Recursion is like any other algorithm useful for a specific problem. Perhaps you mightn't find a use straight away or often but there will be problem you’ll be glad it’s available.

In C++ if the recursive function is a templated one, then the compiler has more chance to optimize it, as all the type deduction and function instantiations will occur in compile time. Modern compilers can also inline the function if possible. So if one uses optimization flags like -O3 or -O2 in g++, then recursions may have the chance to be faster than iterations. In iterative codes, the compiler gets less chance to optimize it, as it is already in the more or less optimal state (if written well enough).
In my case, I was trying to implement matrix exponentiation by squaring using Armadillo matrix objects, in both recursive and iterative way. The algorithm can be found here... https://en.wikipedia.org/wiki/Exponentiation_by_squaring.
My functions were templated and I have calculated 1,000,000 12x12 matrices raised to the power 10. I got the following result:
iterative + optimisation flag -O3 -> 2.79.. sec
recursive + optimisation flag -O3 -> 1.32.. sec
iterative + No-optimisation flag -> 2.83.. sec
recursive + No-optimisation flag -> 4.15.. sec
These results have been obtained using gcc-4.8 with c++11 flag (-std=c++11) and Armadillo 6.1 with Intel mkl. Intel compiler also shows similar results.

Mike is correct. Tail recursion is not optimized out by the Java compiler or the JVM. You will always get a stack overflow with something like this:
int count(int i) {
return i >= 100000000 ? i : count(i+1);
}

You have to keep in mind that utilizing too deep recursion you will run into Stack Overflow, depending on allowed stack size. To prevent this make sure to provide some base case which ends you recursion.

Using just Chrome 45.0.2454.85 m, recursion seems to be a nice amount faster.
Here is the code:
(function recursionVsForLoop(global) {
"use strict";
// Perf test
function perfTest() {}
perfTest.prototype.do = function(ns, fn) {
console.time(ns);
fn();
console.timeEnd(ns);
};
// Recursion method
(function recur() {
var count = 0;
global.recurFn = function recurFn(fn, cycles) {
fn();
count = count + 1;
if (count !== cycles) recurFn(fn, cycles);
};
})();
// Looped method
function loopFn(fn, cycles) {
for (var i = 0; i < cycles; i++) {
fn();
}
}
// Tests
var curTest = new perfTest(),
testsToRun = 100;
curTest.do('recursion', function() {
recurFn(function() {
console.log('a recur run.');
}, testsToRun);
});
curTest.do('loop', function() {
loopFn(function() {
console.log('a loop run.');
}, testsToRun);
});
})(window);
RESULTS
// 100 runs using standard for loop
100x for loop run.
Time to complete: 7.683ms
// 100 runs using functional recursive approach w/ tail recursion
100x recursion run.
Time to complete: 4.841ms
In the screenshot below, recursion wins again by a bigger margin when run at 300 cycles per test

If the iterations are atomic and orders of magnitude more expensive than pushing a new stack frame and creating a new thread and you have multiple cores and your runtime environment can use all of them, then a recursive approach could yield a huge performance boost when combined with multithreading. If the average number of iterations is not predictable then it might be a good idea to use a thread pool which will control thread allocation and prevent your process from creating too many threads and hogging the system.
For example, in some languages, there are recursive multithreaded merge sort implementations.
But again, multithreading can be used with looping rather than recursion, so how well this combination will work depends on more factors including the OS and its thread allocation mechanism.

I found another differences between those approaches.
It looks simple and unimportant, but it has a very important role while you prepare for interviews and this subject arises, so look closely.
In short:
1) iterative post-order traversal is not easy - that makes DFT more complex
2) cycles check easier with recursion
Details:
In the recursive case, it is easy to create pre and post traversals:
Imagine a pretty standard question: "print all tasks that should be executed to execute the task 5, when tasks depend on other tasks"
Example:
//key-task, value-list of tasks the key task depends on
//"adjacency map":
Map<Integer, List<Integer>> tasksMap = new HashMap<>();
tasksMap.put(0, new ArrayList<>());
tasksMap.put(1, new ArrayList<>());
List<Integer> t2 = new ArrayList<>();
t2.add(0);
t2.add(1);
tasksMap.put(2, t2);
List<Integer> t3 = new ArrayList<>();
t3.add(2);
t3.add(10);
tasksMap.put(3, t3);
List<Integer> t4 = new ArrayList<>();
t4.add(3);
tasksMap.put(4, t4);
List<Integer> t5 = new ArrayList<>();
t5.add(3);
tasksMap.put(5, t5);
tasksMap.put(6, new ArrayList<>());
tasksMap.put(7, new ArrayList<>());
List<Integer> t8 = new ArrayList<>();
t8.add(5);
tasksMap.put(8, t8);
List<Integer> t9 = new ArrayList<>();
t9.add(4);
tasksMap.put(9, t9);
tasksMap.put(10, new ArrayList<>());
//task to analyze:
int task = 5;
List<Integer> res11 = getTasksInOrderDftReqPostOrder(tasksMap, task);
System.out.println(res11);**//note, no reverse required**
List<Integer> res12 = getTasksInOrderDftReqPreOrder(tasksMap, task);
Collections.reverse(res12);//note reverse!
System.out.println(res12);
private static List<Integer> getTasksInOrderDftReqPreOrder(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
reqPreOrder(tasksMap,task,result, visited);
return result;
}
private static void reqPreOrder(Map<Integer, List<Integer>> tasksMap, int task, List<Integer> result, Set<Integer> visited) {
if(!visited.contains(task)) {
visited.add(task);
result.add(task);//pre order!
List<Integer> children = tasksMap.get(task);
if (children != null && children.size() > 0) {
for (Integer child : children) {
reqPreOrder(tasksMap,child,result, visited);
}
}
}
}
private static List<Integer> getTasksInOrderDftReqPostOrder(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
reqPostOrder(tasksMap,task,result, visited);
return result;
}
private static void reqPostOrder(Map<Integer, List<Integer>> tasksMap, int task, List<Integer> result, Set<Integer> visited) {
if(!visited.contains(task)) {
visited.add(task);
List<Integer> children = tasksMap.get(task);
if (children != null && children.size() > 0) {
for (Integer child : children) {
reqPostOrder(tasksMap,child,result, visited);
}
}
result.add(task);//post order!
}
}
Note that the recursive post-order-traversal does not require a subsequent reversal of the result. Children printed first and your task in the question printed last. Everything is fine. You can do a recursive pre-order-traversal (also shown above) and that one will require a reversal of the result list.
Not that simple with iterative approach! In iterative (one stack) approach you can only do a pre-ordering-traversal, so you obliged to reverse the result array at the end:
List<Integer> res1 = getTasksInOrderDftStack(tasksMap, task);
Collections.reverse(res1);//note reverse!
System.out.println(res1);
private static List<Integer> getTasksInOrderDftStack(Map<Integer, List<Integer>> tasksMap, int task) {
List<Integer> result = new ArrayList<>();
Set<Integer> visited = new HashSet<>();
Stack<Integer> st = new Stack<>();
st.add(task);
visited.add(task);
while(!st.isEmpty()){
Integer node = st.pop();
List<Integer> children = tasksMap.get(node);
result.add(node);
if(children!=null && children.size() > 0){
for(Integer child:children){
if(!visited.contains(child)){
st.add(child);
visited.add(child);
}
}
}
//If you put it here - it does not matter - it is anyway a pre-order
//result.add(node);
}
return result;
}
Looks simple, no?
But it is a trap in some interviews.
It means the following: with the recursive approach, you can implement Depth First Traversal and then select what order you need pre or post(simply by changing the location of the "print", in our case of the "adding to the result list"). With the iterative (one stack) approach you can easily do only pre-order traversal and so in the situation when children need be printed first(pretty much all situations when you need start print from the bottom nodes, going upwards) - you are in the trouble. If you have that trouble you can reverse later, but it will be an addition to your algorithm. And if an interviewer is looking at his watch it may be a problem for you. There are complex ways to do an iterative post-order traversal, they exist, but they are not simple. Example:https://www.geeksforgeeks.org/iterative-postorder-traversal-using-stack/
Thus, the bottom line: I would use recursion during interviews, it is simpler to manage and to explain. You have an easy way to go from pre to post-order traversal in any urgent case. With iterative you are not that flexible.
I would use recursion and then tell: "Ok, but iterative can provide me more direct control on used memory, I can easily measure the stack size and disallow some dangerous overflow.."
Another plus of recursion - it is simpler to avoid / notice cycles in a graph.
Example (preudocode):
dft(n){
mark(n)
for(child: n.children){
if(marked(child))
explode - cycle found!!!
dft(child)
}
unmark(n)
}

It may be fun to write it as recursion, or as a practice.
However, if the code is to be used in production, you need to consider the possibility of stack overflow.
Tail recursion optimization can eliminate stack overflow, but do you want to go through the trouble of making it so, and you need to know you can count on it having the optimization in your environment.
Every time the algorithm recurses, how much is the data size or n reduced by?
If you are reducing the size of data or n by half every time you recurse, then in general you don't need to worry about stack overflow. Say, if it needs to be 4,000 level deep or 10,000 level deep for the program to stack overflow, then your data size need to be roughly 24000 for your program to stack overflow. To put that into perspective, a biggest storage device recently can hold 261 bytes, and if you have 261 of such devices, you are only dealing with 2122 data size. If you are looking at all the atoms in the universe, it is estimated that it may be less than 284. If you need to deal with all the data in the universe and their states for every millisecond since the birth of the universe estimated to be 14 billion years ago, it may only be 2153. So if your program can handle 24000 units of data or n, you can handle all data in the universe and the program will not stack overflow. If you don't need to deal with numbers that are as big as 24000 (a 4000-bit integer), then in general you don't need to worry about stack overflow.
However, if you reduce the size of data or n by a constant amount every time you recurse, then you can run into stack overflow when n becomes merely 20000. That is, the program runs well when n is 1000, and you think the program is good, and then the program stack overflows when some time in the future, when n is 5000 or 20000.
So if you have a possibility of stack overflow, try to make it an iterative solution.

As far as I know, Perl does not optimize tail-recursive calls, but you can fake it.
sub f{
my($l,$r) = #_;
if( $l >= $r ){
return $l;
} else {
# return f( $l+1, $r );
#_ = ( $l+1, $r );
goto &f;
}
}
When first called it will allocate space on the stack. Then it will change its arguments, and restart the subroutine, without adding anything more to the stack. It will therefore pretend that it never called its self, changing it into an iterative process.
Note that there is no "my #_;" or "local #_;", if you did it would no longer work.

"Is there a performance hit if we use a loop instead of
recursion or vice versa in algorithms where both can serve the same purpose?"
Usually yes if you are writing in a imperative language iteration will run faster than recursion, the performance hit is minimized in problems where the iterative solution requires manipulating Stacks and popping items off of a stack due to the recursive nature of the problem. There are a lot of times where the recursive implementation is much easier to read because the code is much shorter,
so you do want to consider maintainability. Especailly in cases where the problem has a recursive nature. So take for example:
The recursive implementation of Tower of Hanoi:
def TowerOfHanoi(n , source, destination, auxiliary):
if n==1:
print ("Move disk 1 from source",source,"to destination",destination)
return
TowerOfHanoi(n-1, source, auxiliary, destination)
print ("Move disk",n,"from source",source,"to destination",destination)
TowerOfHanoi(n-1, auxiliary, destination, source)
Fairly short and pretty easy to read. Compare this with its Counterpart iterative TowerOfHanoi:
# Python3 program for iterative Tower of Hanoi
import sys
# A structure to represent a stack
class Stack:
# Constructor to set the data of
# the newly created tree node
def __init__(self, capacity):
self.capacity = capacity
self.top = -1
self.array = [0]*capacity
# function to create a stack of given capacity.
def createStack(capacity):
stack = Stack(capacity)
return stack
# Stack is full when top is equal to the last index
def isFull(stack):
return (stack.top == (stack.capacity - 1))
# Stack is empty when top is equal to -1
def isEmpty(stack):
return (stack.top == -1)
# Function to add an item to stack.
# It increases top by 1
def push(stack, item):
if(isFull(stack)):
return
stack.top+=1
stack.array[stack.top] = item
# Function to remove an item from stack.
# It decreases top by 1
def Pop(stack):
if(isEmpty(stack)):
return -sys.maxsize
Top = stack.top
stack.top-=1
return stack.array[Top]
# Function to implement legal
# movement between two poles
def moveDisksBetweenTwoPoles(src, dest, s, d):
pole1TopDisk = Pop(src)
pole2TopDisk = Pop(dest)
# When pole 1 is empty
if (pole1TopDisk == -sys.maxsize):
push(src, pole2TopDisk)
moveDisk(d, s, pole2TopDisk)
# When pole2 pole is empty
else if (pole2TopDisk == -sys.maxsize):
push(dest, pole1TopDisk)
moveDisk(s, d, pole1TopDisk)
# When top disk of pole1 > top disk of pole2
else if (pole1TopDisk > pole2TopDisk):
push(src, pole1TopDisk)
push(src, pole2TopDisk)
moveDisk(d, s, pole2TopDisk)
# When top disk of pole1 < top disk of pole2
else:
push(dest, pole2TopDisk)
push(dest, pole1TopDisk)
moveDisk(s, d, pole1TopDisk)
# Function to show the movement of disks
def moveDisk(fromPeg, toPeg, disk):
print("Move the disk", disk, "from '", fromPeg, "' to '", toPeg, "'")
# Function to implement TOH puzzle
def tohIterative(num_of_disks, src, aux, dest):
s, d, a = 'S', 'D', 'A'
# If number of disks is even, then interchange
# destination pole and auxiliary pole
if (num_of_disks % 2 == 0):
temp = d
d = a
a = temp
total_num_of_moves = int(pow(2, num_of_disks) - 1)
# Larger disks will be pushed first
for i in range(num_of_disks, 0, -1):
push(src, i)
for i in range(1, total_num_of_moves + 1):
if (i % 3 == 1):
moveDisksBetweenTwoPoles(src, dest, s, d)
else if (i % 3 == 2):
moveDisksBetweenTwoPoles(src, aux, s, a)
else if (i % 3 == 0):
moveDisksBetweenTwoPoles(aux, dest, a, d)
# Input: number of disks
num_of_disks = 3
# Create three stacks of size 'num_of_disks'
# to hold the disks
src = createStack(num_of_disks)
dest = createStack(num_of_disks)
aux = createStack(num_of_disks)
tohIterative(num_of_disks, src, aux, dest)
Now the first one is way easier to read because suprise suprise shorter code is usually easier to understand than code that is 10 times longer. Sometimes you want to ask yourself is the extra performance gain really worth it? The amount of hours wasted debugging the code. Is the iterative TowerOfHanoi faster than the Recursive TowerOfHanoi? Probably, but not by a big margin. Would I like to program Recursive problems like TowerOfHanoi using iteration? Hell no. Next we have another recursive function the Ackermann function:
Using recursion:
if m == 0:
# BASE CASE
return n + 1
elif m > 0 and n == 0:
# RECURSIVE CASE
return ackermann(m - 1, 1)
elif m > 0 and n > 0:
# RECURSIVE CASE
return ackermann(m - 1, ackermann(m, n - 1))
Using Iteration:
callStack = [{'m': 2, 'n': 3, 'indentation': 0, 'instrPtr': 'start'}]
returnValue = None
while len(callStack) != 0:
m = callStack[-1]['m']
n = callStack[-1]['n']
indentation = callStack[-1]['indentation']
instrPtr = callStack[-1]['instrPtr']
if instrPtr == 'start':
print('%sackermann(%s, %s)' % (' ' * indentation, m, n))
if m == 0:
# BASE CASE
returnValue = n + 1
callStack.pop()
continue
elif m > 0 and n == 0:
# RECURSIVE CASE
callStack[-1]['instrPtr'] = 'after first recursive case'
callStack.append({'m': m - 1, 'n': 1, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif m > 0 and n > 0:
# RECURSIVE CASE
callStack[-1]['instrPtr'] = 'after second recursive case, inner call'
callStack.append({'m': m, 'n': n - 1, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif instrPtr == 'after first recursive case':
returnValue = returnValue
callStack.pop()
continue
elif instrPtr == 'after second recursive case, inner call':
callStack[-1]['innerCallResult'] = returnValue
callStack[-1]['instrPtr'] = 'after second recursive case, outer call'
callStack.append({'m': m - 1, 'n': returnValue, 'indentation': indentation + 1, 'instrPtr': 'start'})
continue
elif instrPtr == 'after second recursive case, outer call':
returnValue = returnValue
callStack.pop()
continue
print(returnValue)
And once again I will argue that the recursive implementation is much easier to understand. So my conclusion is use recursion if the problem by nature is recursive and requires manipulating items in a stack.

I'm going to answer your question by designing a Haskell data structure by "induction", which is a sort of "dual" to recursion. And then I will show how this duality leads to nice things.
We introduce a type for a simple tree:
data Tree a = Branch (Tree a) (Tree a)
| Leaf a
deriving (Eq)
We can read this definition as saying "A tree is a Branch (which contains two trees) or is a leaf (which contains a data value)". So the leaf is a sort of minimal case. If a tree isn't a leaf, then it must be a compound tree containing two trees. These are the only cases.
Let's make a tree:
example :: Tree Int
example = Branch (Leaf 1)
(Branch (Leaf 2)
(Leaf 3))
Now, let's suppose we want to add 1 to each value in the tree. We can do this by calling:
addOne :: Tree Int -> Tree Int
addOne (Branch a b) = Branch (addOne a) (addOne b)
addOne (Leaf a) = Leaf (a + 1)
First, notice that this is in fact a recursive definition. It takes the data constructors Branch and Leaf as cases (and since Leaf is minimal and these are the only possible cases), we are sure that the function will terminate.
What would it take to write addOne in an iterative style? What will looping into an arbitrary number of branches look like?
Also, this kind of recursion can often be factored out, in terms of a "functor". We can make Trees into Functors by defining:
instance Functor Tree where fmap f (Leaf a) = Leaf (f a)
fmap f (Branch a b) = Branch (fmap f a) (fmap f b)
and defining:
addOne' = fmap (+1)
We can factor out other recursion schemes, such as the catamorphism (or fold) for an algebraic data type. Using a catamorphism, we can write:
addOne'' = cata go where
go (Leaf a) = Leaf (a + 1)
go (Branch a b) = Branch a b

Related

Numerical stability of ODE system

I have to perform a numerical solving of an ODE system which has the following form:
du_j/dt = f_1(u_j, v_j, t) + g_1(t)v_(j-1) + h_1(t)v_(j+1),
dv_j/dt = f_2(u_j, v_j, t) + g_2(t)u_(j-1) + h_2(t)u_(j+1),
where u_j(t) and v_j(t) are complex-valued scalar functions of time t, f_i and g_i are given functions, and j = -N,..N. This is an initial value problem and the task is to find the solution at a certain time T.
If g_i(t) = h_i(t) = 0, then the equations for different values of j can be solved independently. In this case I obtain a stable and accurate solutions with the aid of the fourth-order Runge-Kutta method. However, once I turn on the couplings, the results become very unstable with respect to the time grid step and explicit form of the functions g_i, h_i.
I guess it is reasonable to try to employ an implicit Runge-Kutta scheme, which might be stable in such a case, but if I do so, I will have to evaluate the inverse of a huge matrix of size 4*N*c, where c depends on the order of the method (e. g. c = 3 for the Gauss–Legendre method) at each step. Of course, the matrix will mostly contain zeros and have a block tridiagonal form but it still seems to be very time-consuming.
So I have two questions:
Is there a stable explicit method which works even when the coupling functions g_i and h_i are (very) large?
If an implicit method is, indeed, a good solution, what is the fastest method of the inversion of a block tridiagonal matrix? At the moment I just perform a simple Gauss method avoiding redundant operations which arise due to the specific structure of the matrix.
Additional info and details that might help us:
I use Fortran 95.
I currently consider g_1(t) = h_1(t) = g_2(t) = h_2(t) = -iAF(t)sin(omega*t), where i is the imaginary unit, A and omega are given constants, and F(t) is a smooth envelope going slowly, first, from 0 to 1 and then from 1 to 0, so F(0) = F(T) = 0.
Initially u_j = v_j = 0 unless j = 0. The functions u_j and v_j with great absolute values of j are extremely small for all t, so the initial peak does not reach the "boundaries".
To 1) There will be no stable explicit method if your functions are very large. This is due to the fact that the area of stability of explicit (Runge-Kutta) methods is compact.
To 2) If your matrices are larger then 100x100 you could use this method:
Inverses of Block Tridiagonal Matrices and Rounding Errors.

Is there a way to avoid creating an array in this Julia expression?

Is there a way to avoid creating an array in this Julia expression:
max((filter(n -> string(n) == reverse(string(n)), [x*y for x = 1:N, y = 1:N])))
and make it behave similar to this Python generator expression:
max(x*y for x in range(N+1) for y in range(x, N+1) if str(x*y) == str(x*y)[::-1])
Julia version is 2.3 times slower then Python due to array allocation and N*N iterations vs. Python's N*N/2.
EDIT
After playing a bit with a few implementations in Julia, the fastest loop style version I've got is:
function f(N) # 320ms for N=1000 Julia 0.2.0 i686-w64-mingw32
nMax = NaN
for x = 1:N, y = x:N
n = x*y
s = string(n)
s == reverse(s) || continue
nMax < n && (nMax = n)
end
nMax
end
but an improved functional version isn't far behind (only 14% slower or significantly faster, if you consider 2x larger domain):
function e(N) # 366ms for N=1000 Julia 0.2.0 i686-w64-mingw32
isPalindrome(n) = string(n) == reverse(string(n))
max(filter(isPalindrome, [x*y for x = 1:N, y = 1:N]))
end
There is 2.6x unexpected performance improvement by defining isPalindrome function, compared to original version on the top of this page.
We have talked about allowing the syntax
max(f(x) for x in itr)
as a shorthand for producing each of the values f(x) in one coroutine while computing the max in another coroutine. This would basically be shorthand for something like this:
max(#task for x in itr; produce(f(x)); end)
Note, however, that this syntax that explicitly creates a task already works, although it is somewhat less pretty than the above. Your problem can be expressed like this:
max(#task for x=1:N, y=x:N
string(x*y) == reverse(string(x*y)) && produce(x*y)
end)
With the hypothetical producer syntax above, it could be reduced to something like this:
max(x*y if string(x*y) == reverse(string(x*y) for x=1:N, y=x:N)
While I'm a fan of functional style, in this case I would probably just use a for loop:
m = 0
for x = 1:N, y = x:N
n = x*y
string(n) == reverse(string(n)) || continue
m < n && (m = n)
end
Personally, I don't find this version much harder to read and it will certainly be quite fast in Julia. In general, while functional style can be convenient and pretty, if your primary focus is on performance, then explicit for loops are your friend. Nevertheless, we should make sure that John's max/filter/product version works. The for loop version also makes other optimizations easier to add, like Harlan's suggestion of reversing the loop ordering and exiting on the first palindrome you find. There are also faster ways to check if a number is a palindrome in a given base than actually creating and comparing strings.
As to the general question of "getting flexible generators and list comprehensions in Julia", the language already has
A general high-performance iteration protocol based on the start/done/next functions.
Far more powerful multidimensional array comprehensions than most languages. At this point, the only missing feature is the if guard, which is complicated by the interaction with multidimensional comprehensions and the need to potentially dynamically grow the resulting array.
Coroutines (aka tasks) which allow, among other patterns, the producer-consumer pattern.
Python has the if guard but doesn't worry about comprehension performance nearly as much – if we're going to add that feature to Julia's comprehensions, we're going to do it in a way that's both fast and interacts well with multidimensional arrays, hence the delay.
Update: The max function is now called maximum (maximum is to max as sum is to +) and the generator syntax and/or filters work on master, so for example, you can do this:
julia> #time maximum(100x - x^2 for x = 1:100 if x % 3 == 0)
0.059185 seconds (31.16 k allocations: 1.307 MB)
2499
Once 0.5 is out, I'll update this answer more thoroughly.
There are two questions being mixed together here: (1) can you filter a list comprehension mid-comprehension (for which the answer is currently no) and (2) can you use a generator that doesn't allocate an array (for which the answer is partially yes). Generators are provided by the Iterators package, but the Iterators package seems to not play well with filter at the moment. In principle, the code below should work:
max((x, y) -> x * y,
filter((x, y) -> string(x * y) == reverse(string(x * y)),
product(1:N, 1:N)))
I don't think so. There aren't currently filters in Julia array comprehensions. See discussion in this issue.
In this particular case, I'd suggest just nested for loops if you want to get faster computation.
(There might be faster approaches where you start with N and count backwards, stopping as soon as you find something that succeeds. Figuring out how to do that correctly is left as an exercise, etc...)
As mentioned, this is now possible (using Julia 0.5.0)
isPalindrome(n::String) = n == reverse(n)
fun(N::Int) = maximum(x*y for x in 1:N for y in x:N if isPalindrome(string(x*y)))
I'm sure there are better ways that others can comment on. Time (after warm-up):
julia> #time fun(1000);
0.082785 seconds (2.03 M allocations: 108.109 MB, 27.35% gc time)

How can a compiler apply function elimination to impure functions?

Often times when writing code, I find myself using a value from a particular function call multiple times. I realized that an obvious optimization would be to capture these repeatedly used values in variables.
This (pseudo code):
function add1(foo){ foo + 1; }
...
do_something(foo(1));
do_something_else(foo(1));
Becomes:
function add1(foo){ foo + 1; }
...
bar = foo(1);
do_something(bar);
do_something_else(bar);
However, doing this explicitly makes code less readable in my experience. I assumed that compilers could not do this kind of optimization if our language of choice allows functions to have side-effects.
Recently I looked into this, and if I understand correctly, this optimization is/can be done for languages where functions must be pure. That does not surprise me, but supposedly this can also be done for impure functions. With a few quick Google searches I found these snippets:
GCC 4.7 Fortran improvement
When performing front-end-optimization, the -faggressive-function-elimination option allows the removal of duplicate function calls even for impure functions.
Compiler Optimization (Wikipedia)
For example, in some languages functions are not permitted to have side effects. Therefore, if a program makes several calls to the same function with the same arguments, the compiler can immediately infer that the function's result need be computed only once. In languages where functions are allowed to have side effects, another strategy is possible. The optimizer can determine which function has no side effects, and restrict such optimizations to side effect free functions. This optimization is only possible when the optimizer has access to the called function.
From my understanding, this means that an optimizer can determine when a function is or is not pure, and perform this optimization when the function is. I say this because if a function always produces the same output when given the same input, and is side effect free, it would fulfill both conditions to be considered pure.
These two snippets raise two questions for me.
How can a compiler be able to safely make this optimization if a function is not pure? (as in -faggressive-function-elimination)
How can a compiler determine whether a function is pure or not? (as in the strategy suggested in the Wikipedia article)
and finally:
Can this kind of optimization be applied to any language, or only when certain conditions are met?
Is this optimization a worthwhile one even for extremely simple functions?
How much overhead does storing and retrieving a value from the stack incur?
I apologize if these are stupid or illogical questions. They are just some things I have been curious about lately. :)
Disclaimer: I'm not a compiler/optimizer guy, I only have a tendency to peek at the generated code, and like to read about that stuff - so that's not autorative. A quick search didn't turn up much on -faggressive-function-elimination, so it might do some extra magic not explained here.
An optimizer can
attempt to inline the function call (with link time code generation, this works even across compilation units)
perform common subexpression elimination, and, possibly, side effect reordering.
Modifying your example a bit, and doing it in C++:
extern volatile int RW_A = 0; // see note below
int foo(int a) { return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
}
Resolves to (pseudocode)
<register> = 4;
RW_A = register;
RW_A = register;
(Note: reading from and writing to a volatile variable is an "observable side effect", that the optimizer must preserve in the same order given by the code.)
Modifying the example for foo to have a side effect:
extern volatile int RW_A = 0;
extern volatile int RW_B = 0;
int accu = 1;
int foo(int a) { accu *= 2; return a * a; }
void bar(int x) { RW_A = x; }
int _tmain(int argc, _TCHAR* argv[])
{
bar(foo(2));
bar(foo(2));
RW_B = accu;
return 0;
}
generates the following pseudocode:
registerA = accu;
registerA += registerA;
accu = registerA;
registerA += registerA;
registerC = 4;
accu = registerA;
RW_A = registerC;
RW_A = registerC;
RW_B = registerA;
We observe that common subexpression elimination is still done, and separated from the side effects. Inlining and reordering allows to separate the side effects from the "pure" part.
Note that the compiler reads and eagerly writes back to accu, which wouldn't be necessary. I'm not sure on the rationale here.
To conclude:
A compiler does not need to test for purity. It can identify side effects that need to be preserved, and then transform the rest to its liking.
Such optimizations are worthwhile, even for trivial functions, because, among others,
CPU and memory are shared resources, what you use you might take away from someone else
Battery life
Minor optimizations may be gateways to larger ones
The overhead for a stack memory access is usually ~1 cycle, since the top of stack is usually in Level 1 cache already. Note that the usually should be in bold: it can be "even better", since the read / write may be optimized away, or it can be worse since the increased pressure on L1 cache flushes some other important data back to L2.
Where's the limit?
Theoretically, compile time. In practice, predictability and correctness of the optimizer are additional tradeoffs.
All tests with VC2008, default optimization settings for "Release" build.

Cumulative sum in two dimensions on array in nested loop -- CUDA implementation?

I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation suggestions.
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int Nobs = 60480;
int NgS = 1859;
int NgO = 900;
// ngo goes from [1,900]
// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
for (n=0; n<NgS; n++) {
Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
}
}
In a previous case, when precalculatedValue was only a function of the inner loop variable, I saved the values in unique array indices and added them with a parallel reduction (Thrust) after the fact. However, this case has me stumped: the values of m are not uniquely mapped to the values of ngo. Thus, I don't see a way of making this code efficient (or even workable) to use a reduction on. Any ideas are most welcome.
Just a stab...
I suspect that transposing your loops might help.
for (n=0; n<NgS; n++) {
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
}
}
The reason I did this is because idx varies more rapidly with ngo (function of m) than with n, so making m the inner loop improves coherence. Note I also made precalculatedValue a function of (m, n) because you said that it is -- this makes the pseudocode clearer.
Then, you could start by leaving the outer loop on the host, and making a kernel for the inner loop (64,480-way parallelism is enough to fill most current GPUs).
In the inner loop, just start by using an atomicAdd() to handle collisions. If they are infrequent, on Fermi GPUs performance shouldn't be too bad. In any case, you will be bandwidth bound since arithmetic intensity of this computation is low. So once this is working, measure the bandwidth you are achieving, and compare to the peak for your GPU. If you are way off, then think about optimizing further (perhaps by parallelizing the outer loop -- one iteration per thread block, and do the inner loop using some shared memory and thread cooperation optimizations).
The key: start simple, measure performance, and then decide how to optimize.
Note that this calculation looks very similar to a histogram calculation, which has similar challenges, so you might want to google for GPU histograms to see how they have been implemented.
One idea is to sort (rInd[m], m) pairs using thrust::sort_by_key() and then (since the rInd duplicates will be grouped together), you can iterate over them and do your reductions without collisions. (This is one way to do histograms.) You could even do this with thrust::reduce_by_key().

How does recursion make the use of run-time memory unpredictable?

Quoting from Code Complete 2,
int Factorial( int number ) {
if ( number == 1 ) {
return 1;
}
else {
return number * Factorial( number - 1 );
}
}
In addition to being slow [1] and making
the use of run-time memory
unpredictable [2], the recursive version
of this routine is harder to
understand than the iterative version,
which follows:
int Factorial( int number ) {
int intermediateResult = 1;
for ( int factor = 2; factor <= number; factor++ ) {
intermediateResult = intermediateResult * factor;
}
return intermediateResult;
}
I think the slow part is because of the unnecessary function call overheads.
But how does recursion make the use of run-time memory unpredictable?
Can't we always predict how much memory would be needed (as we know when the recursion is supposed to end)? I think it would be as unpredictable as the iterative case, but not any more.
Because of the fact recursive methods call them selves repeatedly, the need lots of stack memory. Since the stack is limited, errors will occur if the stack memoy is exceeded.
Can't we always predict how much memory would be needed (as we know when the recursion is supposed to end)? I think it would be as unpredictable as the iterative case, but not any more.
No, not in the general case. See discussion about the halting problem for more background. Now, here's a recursive version of one of the problems posted there:
void u(int x) {
if (x != 1) {
u((x % 2 == 0) ? x/2 : 3*x+1);
}
}
It's even tail-recursive. Since you can't predict if this will even terminate normally, how can you predict how much memory is needed?
If the recursion level becomes too deep, you'll blow the call stack and eat up lots of memory in the process. This can happen if your number is a "large enough" value. Can you do worse than this? Yes, if your function allocates more objects with every recursion call.