What is Map/Reduce? - language-agnostic

I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?

From the abstract of Google's MapReduce research publication page:
MapReduce is a programming model and
an associated implementation for
processing and generating large data
sets. Users specify a map function
that processes a key/value pair to
generate a set of intermediate
key/value pairs, and a reduce function
that merges all intermediate values
associated with the same intermediate
key.
The advantage of MapReduce is that the processing can be performed in parallel on multiple processing nodes (multiple servers) so it is a system that can scale very well.
Since it's based from the functional programming model, the map and reduce steps each do not have any side-effects (the state and results from each subsection of a map process does not depend on another), so the data set being mapped and reduced can each be separated over multiple processing nodes.
Joel's Can Your Programming Language Do This? piece discusses how understanding functional programming was essential in Google to come up with MapReduce, which powers its search engine. It's a very good read if you're unfamiliar with functional programming and how it allows scalable code.
See also: Wikipedia: MapReduce
Related question: Please explain mapreduce simply

Map is a function that applies another function to all the items on a list, to produce another list with all the return values on it. (Another way of saying "apply f to x" is "call f, passing it x". So sometimes it sounds nicer to say "apply" instead of "call".)
This is how map is probably written in C# (it's called Select and is in the standard library):
public static IEnumerable<R> Select<T, R>(this IEnumerable<T> list, Func<T, R> func)
{
foreach (T item in list)
yield return func(item);
}
As you're a Java dude, and Joel Spolsky likes to tell GROSSLY UNFAIR LIES about how crappy Java is (actually, he's not lying, it is crappy, but I'm trying to win you over), here's my very rough attempt at a Java version (I have no Java compiler, and I vaguely remember Java version 1.1!):
// represents a function that takes one arg and returns a result
public interface IFunctor
{
object invoke(object arg);
}
public static object[] map(object[] list, IFunctor func)
{
object[] returnValues = new object[list.length];
for (int n = 0; n < list.length; n++)
returnValues[n] = func.invoke(list[n]);
return returnValues;
}
I'm sure this can be improved in a million ways. But it's the basic idea.
Reduce is a function that turns all the items on a list into a single value. To do this, it needs to be given another function func that turns two items into a single value. It would work by giving the first two items to func. Then the result of that along with the third item. Then the result of that with the fourth item, and so on until all the items have gone and we're left with one value.
In C# reduce is called Aggregate and is again in the standard library. I'll skip straight to a Java version:
// represents a function that takes two args and returns a result
public interface IBinaryFunctor
{
object invoke(object arg1, object arg2);
}
public static object reduce(object[] list, IBinaryFunctor func)
{
if (list.length == 0)
return null; // or throw something?
if (list.length == 1)
return list[0]; // just return the only item
object returnValue = func.invoke(list[0], list[1]);
for (int n = 1; n < list.length; n++)
returnValue = func.invoke(returnValue, list[n]);
return returnValue;
}
These Java versions need generics adding to them, but I don't know how to do that in Java. But you should be able to pass them anonymous inner classes to provide the functors:
string[] names = getLotsOfNames();
string commaSeparatedNames = (string)reduce(names,
new IBinaryFunctor {
public object invoke(object arg1, object arg2)
{ return ((string)arg1) + ", " + ((string)arg2); }
}
Hopefully generics would get rid of the casts. The typesafe equivalent in C# is:
string commaSeparatedNames = names.Aggregate((a, b) => a + ", " + b);
Why is this "cool"? Simple ways of breaking up larger calculations into smaller pieces, so they can be put back together in different ways, are always cool. The way Google applies this idea is to parallelization, because both map and reduce can be shared out over several computers.
But the key requirement is NOT that your language can treat functions as values. Any OO language can do that. The actual requirement for parallelization is that the little func functions you pass to map and reduce must not use or update any state. They must return a value that is dependent only on the argument(s) passed to them. Otherwise, the results will be completely screwed up when you try to run the whole thing in parallel.

After getting most frustrated with either very long waffley or very short vague blog posts I eventually discovered this very good rigorous concise paper.
Then I went ahead and made it more concise by translating into Scala, where I've provided the simplest case where a user simply just specifies the map and reduce parts of the application. In Hadoop/Spark, strictly speaking, a more complex model of programming is employed that require the user to explicitly specify 4 more functions outlined here: http://en.wikipedia.org/wiki/MapReduce#Dataflow
import scalaz.syntax.id._
trait MapReduceModel {
type MultiSet[T] = Iterable[T]
// `map` must be a pure function
def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
(data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] =
data.flatMap(map)
def shufflePhase[K2, V2](mappedData: MultiSet[(K2, V2)]): Map[K2, MultiSet[V2]] =
mappedData.groupBy(_._1).mapValues(_.map(_._2))
// `reduce` must be a monoid
def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
(shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
shuffledData.flatMap(reduce).map(_._2)
def mapReduce[K1, K2, V1, V2, V3](data: MultiSet[(K1, V1)])
(map: ((K1, V1)) => MultiSet[(K2, V2)])
(reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)]): MultiSet[V3] =
mapPhase(map)(data) |> shufflePhase |> reducePhase(reduce)
}
// Kinda how MapReduce works in Hadoop and Spark except `.par` would ensure 1 element gets a process/thread on a cluster
// Furthermore, the splitting here won't enforce any kind of balance and is quite unnecessary anyway as one would expect
// it to already be splitted on HDFS - i.e. the filename would constitute K1
// The shuffle phase will also be parallelized, and use the same partition as the map phase.
abstract class ParMapReduce(mapParNum: Int, reduceParNum: Int) extends MapReduceModel {
def split[T](splitNum: Int)(data: MultiSet[T]): Set[MultiSet[T]]
override def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
(data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] = {
val groupedByKey = data.groupBy(_._1).map(_._2)
groupedByKey.flatMap(split(mapParNum / groupedByKey.size + 1))
.par.flatMap(_.map(map)).flatten.toList
}
override def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
(shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
shuffledData.map(g => split(reduceParNum / shuffledData.size + 1)(g._2).map((g._1, _)))
.par.flatMap(_.map(reduce))
.flatten.map(_._2).toList
}

Map is a native JS method that can be applied to an array. It creates a new array as a result of some function mapped to every element in the original array. So if you mapped a function(element) { return element * 2;}, it would return a new array with every element doubled. The original array would go unmodified.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map
Reduce is a native JS method that can also be applied to an array. It applies a function to an array and has an initial output value called an accumulator. It loops through each element in the array, applies a function, and reduces them to a single value (which begins as the accumulator). It is useful because you can have any output you want, you just have to start with that type of accumulator. So if I wanted to reduce something into an object, I would start with an accumulator {}.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/Reduce?v=a

Related

Returning by reference from struct method in D

I'm starting my journey in D from C++. In C++ passing by reference or value is quite explicit, but in D it seems to vary between structs and classes.
My question is how can I force a return by reference?
I have a simple XmlNode class for building Xml trees (which is a lift from my C++ code):
import std.stdio;
struct XmlNode
{
string _name;
string _data;
XmlNode[] _children;
this(string name, string data="")
{
_name = name;
_data = data;
}
//Trying to return a reference to the added Node
ref XmlNode addChild(string name,string data = "")
{
_children ~= XmlNode(name,data);
return _children[$-1];
}
string toString(bool bPlain = true, string indent = "")
{
//Omitted for brevity
}
}
And here is the testing code:
int main()
{
auto root = XmlNode("root");
//Chained call
root.addChild("Level 1").addChild("Level 2","42");
//Call in two parts
auto n = root.addChild("Level 1");
n.addChild("Level 2","101"); //n seems to be a copy not a reference
//Chained call
root.addChild("Level 1").addChild("Level 2","999");
writeln(root.toString(false));
return 0;
}
which gives the following output:
root
Level 1
Level 2
42
Level 1
Level 1
Level 2
999
As you can see the 'chained' use of addChild() performs as hoped. But if I try to break it up into two separate calls, only the first has an effect, and the second seems to operate on a copy of the first, not a reference. I optimistically added a ref qualifier to the addChild() signature, but that doesn't seem to help.
As ever, I'd be grateful for any advice (using DMD / Visual D / Visual Studio / Windows 10).
auto n = root.addChild("Level 1");
Here, though addChild returns a reference, it is assigned to a variable, and thus dereferenced and copied. Instead, you probably want:
auto n = &root.addChild("Level 1");
Note that D does not have reference variables, like in C++. Variables can be only pointers (though it's possible to write a wrapper template with reference-like semantics).
Also note that in the current design of XmlNode, the returned reference will only be valid until the next time _children is modified (as that may cause a reallocation and thus move the contents to another address, making any extant references outdated). It is a common footgun, which could be avoided by storing references of XmlNode (or making it a reference type i.e. a class), at the cost of extra dereferences and allocations.

Scala: val foo = (arg: Type) => {...} vs. def(arg:Type) = {...}

Related to this thread
I am still unclear on the distinction between these 2 definitions:
val foo = (arg: Type) => {...}
def(arg:Type) = {...}
As I understand it:
1) the val version is bound once, at compile time
a single Function1 instance is created
can be passed as a method parameter
2) the def version is bound anew on each call
new method instance created per call.
If the above is true, then why would one ever choose the def version in cases where the operation(s) to perform are not dependent on runtime state?
For example, in a servlet environment you might want to get the ip address of the connecting client; in this case you need to use a def as, of course there is no connected client at compile time.
On the other hand you often know, at compile time, the operations to perform, and can go with immutable val foo = (i: Type) => {...}
As a rule of thumb then, should one only use defs when there is a runtime state dependency?
Thanks for clarifying
I'm not entirely clear on what you mean by runtime state dependency. Both vals and defs can close over their lexical scope and are hence unlimited in this way. So what are the differences between methods (defs) and functions (as vals) in Scala (which has been asked and answered before)?
You can parameterize a def
For example:
object List {
def empty[A]: List[A] = Nil //type parameter alllowed here
val Empty: List[Nothing] = Nil //cannot create a type parameter
}
I can then call:
List.empty[Int]
But I would have to use:
List.Empty: List[Int]
But of course there are other reasons as well. Such as:
A def is a method at the JVM level
If I were to use the piece of code:
trades filter isEuropean
I could choose a declaration of isEuropean as either:
val isEuropean = (_ : Trade).country.region = Europe
Or
def isEuropean(t: Trade) = t.country.region = Europe
The latter avoids creating an object (for the function instance) at the point of declaration but not at the point of use. Scala is creating a function instance for the method declaration at the point of use. It is clearer if I had used the _ syntax.
However, in the following piece of code:
val b = isEuropean(t)
...if isEuropean is declared a def, no such object is being created and hence the code may be more performant (if used in very tight loops where every last nanosecond is of critical value)

What are some uses of closures for OOP?

PHP and .Net have closures; I have been wondering what are some examples of using closures in OOP and design patterns, and what advantages they have over pure OOP programming.
As a clarification, this is not a OOP vs. functional programming, but how to best use closures in a OOP design. How do closures fit in, say, factories or the observer pattern? What are some tricks you can pull which clarify the design and results in looser coupling, for example.
Closures are useful for event-handling. This example is a bit contrived, but I think it conveys the idea:
class FileOpener
{
public FileOpener(OpenFileTrigger trigger)
{
trigger.FileOpenTriggered += (sender, args) => { this.Open(args.PathToFile); };
}
public void Open(string pathToFile)
{
//…
}
}
my file opener can either open a file by directly calling instance.Open(pathToFile), or it can be triggered by some event. If I didn't have anonymous functions + closures, I'd have to write a method that had no other purpose than to respond to this event.
Any language that has closures can use them for trampolining, which is a technique for refactoring recursion into iteration. This can get you out of "stack overflow" problems that naive implementations of many algorithms run into.
A trampoline is a function that "bounces" a closure back up to its caller. The closure captures "the rest of the work".
For example, in Python you can define a recursive accumulator to sum the values in an array:
testdata = range(0, 1000)
def accum(items):
if len(items) == 0:
return 0
elif len(items) == 1:
return items[0]
else:
return items[0] + accum(items[1:])
print "will blow up:", accum(testdata)
On my machine, this craps out with a stack overflow when the length of items exceeds 998.
The same function can be done in a trampoline style using closures:
def accum2(items):
bounced = trampoline(items, 0)
while (callable(bounced)):
bounced = bounced()
return bounced
def trampoline(items, initval):
if len(items) == 0:
return initval
else:
return lambda: trampoline(items[1:], initval+items[0])
By converting recursion to iteration, you don't blow out the stack. The closure has the property of capturing the state of the computation in itself rather than on the stack as you do with recursion.
Suppose you want to provide a class with the ability to create any number of FileOpener instances, but following IoC principles, you don't want the class creating FileOpeners to actually know how to do so (in other words, you don't want to new them). Instead, you want to use dependency injection. However, you only want this class to be able to generate FileOpener instances, and not just any instance. Here's what you can do:
class AppSetup
{
private IContainer BuildDiContainer()
{
// assume this builds a dependency injection container and registers the types you want to create
}
public void setup()
{
IContainer container = BuilDiContainer();
// create a function that uses the dependency injection container to create a `FileOpener` instance
Func<FileOpener> getFileOpener = () => { return container.Resolve<FileOpener>(); };
DependsOnFileOpener dofo = new DependsOnFileOpener(getFileOpener);
}
}
Now you have your class that needs to be able to make FileOpener instances. You can use dependency injection to provide it with this capability, while retaining loose coupling
class DependsOnFileOpener()
{
public DependesOnFileOpener(Func<FileOpener> getFileOpener)
{
// this class can create FileOpener instances any time it wants, without knowing where they come from
FileOpener f = getFileOpener();
}
}

Creating a "true" HashMap implementation with Object Equality in ActionScript 3

I've been spending some of my spare time working a set of collections for ActionScript 3 but I've hit a pretty serious roadblock thanks for the way ActionScript 3 handles equality checks inside Dictionary Objects.
When you compare a key in a dictionary, ActionScript uses the === operator to perform the comparison, this has a bit of a nasty side effect whereby only references to the same instance will resolve true and not objects of equality. Here's what I mean:
const jonny1 : Person = new Person("jonny", 26);
const jonny2 : Person = new Person("jonny", 26);
const table : Dictionary = new Dictionary();
table[jonny1] = "That's me";
trace(table[jonny1]) // traces: "That's me"
trace(table[jonny2]) // traces: undefined.
The way I am attempting to combat this is to provide an Equalizer interface which looks like this:
public interface Equalizer
{
function equals(object : Object) : Boolean;
}
This allows to to perform an instanceOf-esq. check whenever I need to perform an equality operation inside my collections (falling back on the === operator when the object doesn't implement Equalizer); however, this doesn't get around the fact that my underlying datastructure (the Dictionary Object) has no knowledge of this.
The way I am currently working around the issue is by iterating through all the keys in the dictionary and performing the equality check whenever I perform a containsKey() or get() operation - however, this pretty much defeats the entire point of a hashmap (cheap lookup operations).
If I am unable to continue using a Dictionary instance as the backing for map, how would I go about creating the hashes for unique object instances passed in as keys so I can still maintain equality?
How about you compute a hash code for your objects when you insert them, and then look them up by the hash code in your backing dictionary? The hashcode should compare === just fine. Of course, that would require you to have a Hashable interface for your object types instead of your Equalizer interface, so it isn't much less work than you are already doing, but you do get the cheap lookups.
How about rather doing this:
public interface Hashable {
function hash():String;
}
personally, I ask myself, why you want to do this ... hashing objects to obtain keys makes little sense if they are mutable ...
also, you might consider using a different approach, as for example this factory:
package {
public class Person {
/**
* don't use this!
* #private
*/
public function Person(name:String, age:int) {
if (!instantiationAllowed)
throw new Error("use Person.getPerson instead of constructor");
//...
}
private static var instantiationAllowed:Boolean = false;
private static var map:Object = {};
private static function create(name:String, age:int):Person {
instantiationAllowed = true;
var ret:Person = new Person(name, age);
instantiationAllowed = false;
}
public static function getPerson(name:String, age:int):Person {
var ageMap:Array = map[name];
if (ageMap == null) {
map[name] = ageMap = [];
return ageMap[age] = Person.create(name, age);
}
if (ageMap.hasOwnProperty(age))
return ageMap[age];
return ageMap[age] = Person.create(name, age);
}
}
}
it ensures, there's only one person with a given name and age (if that makes any sense) ...
Old thread I know, but still worth posting.
const jonny1 : Person = new Person("jonny", 26); const jonny2 : Person = new Person("jonny", 26);
is creating two completely different objects that will not compare using ==, guess I don't see why it's any more of a road block because of as3
The problem with AS3/JavaScript/EcmaScript is not that they create two different, equivalent objects.
The problem is that they cannot equate those two equivalent objects--only identity works, since there is no equals or hashCode methods that can be overriden with class-specific comparison logic.
For Map implementations such as dynamic Object or Dictionary, this means that you have to either use Strings or references as keys: you cannot recover objects from a map using different but equivalent objects.
To work around that problem, people either resort to strict toString implementations (for Object maps) which is undesirable, or to instance control for Dictionaries, as in #back2dos example, which introduces different problems (Also, note that #back2dos solution does not really guarantee unique Person instances since there is a time window during which asynchronous threads will be allowed to instantiate new Persons).
#A.Levy's solution is good except that in general, hashCodes are not strictly required to issue unique values (they are meant to map entries to buckets allowing for fast lookups, wherein fine-grained differentiation is done through equals method).
You need both a hashCode and an equals method, e.g.
public interface IEquable
{
function equals(object : Object) : Boolean;
function hash():String;
}
In any programming language,
const jonny1 : Person = new Person("jonny", 26);
const jonny2 : Person = new Person("jonny", 26);
is creating two completely different objects that will not compare using ==, guess I don't see why it's any more of a road block because of as3

How do you return non-copyable types?

I am trying to understand how you return non-primitives (i.e. types that do not implement Copy). If you return something like a i32, then the function creates a new value in memory with a copy of the return value, so it can be used outside the scope of the function. But if you return a type that doesn't implement Copy, it does not do this, and you get ownership errors.
I have tried using Box to create values on the heap so that the caller can take ownership of the return value, but this doesn't seem to work either.
Perhaps I am approaching this in the wrong manner by using the same coding style that I use in C# or other languages, where functions return values, rather than passing in an object reference as a parameter and mutating it, so that you can easily indicate ownership in Rust.
The following code examples fails compilation. I believe the issue is only within the iterator closure, but I have included the entire function just in case I am not seeing something.
pub fn get_files(path: &Path) -> Vec<&Path> {
let contents = fs::walk_dir(path);
match contents {
Ok(c) => c.filter_map(|i| { match i {
Ok(d) => {
let val = d.path();
let p = val.as_path();
Some(p)
},
Err(_) => None } })
.collect(),
Err(e) => panic!("An error occurred getting files from {:?}: {}", pa
th, e)
}
}
The compiler gives the following error (I have removed all the line numbers and extraneous text):
error: `val` does not live long enough
let p = val.as_path();
^~~
in expansion of closure expansion
expansion site
reference must be valid for the anonymous lifetime #1 defined on the block...
...but borrowed value is only valid for the block suffix following statement
let val = d.path();
let p = val.as_path();
Some(p)
},
You return a value by... well returning it. However, your signature shows that you are trying to return a reference to a value. You can't do that when the object will be dropped at the end of the block because the reference would become invalid.
In your case, I'd probably write something like
#![feature(fs_walk)]
use std::fs;
use std::path::{Path, PathBuf};
fn get_files(path: &Path) -> Vec<PathBuf> {
let contents = fs::walk_dir(path).unwrap();
contents.filter_map(|i| {
i.ok().map(|p| p.path())
}).collect()
}
fn main() {
for f in get_files(Path::new("/etc")) {
println!("{:?}", f);
}
}
The main thing is that the function returns a Vec<PathBuf> — a collection of a type that owns the path, and are more than just references into someone else's memory.
In your code, you do let p = val.as_path(). Here, val is a PathBuf. Then you call as_path, which is defined as: fn as_path(&self) -> &Path. This means that given a reference to a PathBuf, you can get a reference to a Path that will live as long as the PathBuf will. However, you are trying to keep that reference around longer than vec will exist, as it will be dropped at the end of the iteration.
How do you return non-copyable types?
By value.
fn make() -> String { "Hello, World!".into() }
There is a disconnect between:
the language semantics
the implementation details
Semantically, returning by value is moving the object, not copying it. In Rust, any object is movable and, optionally, may also be Clonable (implement Clone) and Copyable (implement Clone and Copy).
That the implementation of copying or moving uses a memcpy under the hood is a detail that does not affect the semantics, only performance. Furthermore, this being an implementation detail means that it can be optimized away without affecting the semantics, which the optimizer will try very hard to do.
As for your particular code, you have a lifetime issue. You cannot return a reference to a value if said reference may outlive the value (for then, what would it reference?).
The simple fix is to return the value itself: Vec<PathBuf>. As mentioned, it will move the paths, not copy them.