I am wanting to create a cython object that can has convenient operations such as addition multiplication and comparisons. But when I compile such classes they all seem to have a lot of python overhead.
A simple example:
%%cython -a
cdef class Pair:
cdef public:
int a
int b
def __init__(self, int a, int b):
self.a = a
self.b = b
def __add__(self, Pair other):
return Pair(self.a + other.a, self.b + other.b)
p1 = Pair(1, 2)
p2 = Pair(3, 4)
p3 = p1+p2
print(p3.a, p3.b)
But I end up getting quite large readouts from the annotated compiler
It seems like the __add__ function is converting objects from python floats to cython doubles and doing a bunch of type checking. Am I doing something wrong?
There's likely a couple of issues:
I'm assuming that you're using Cython 0.29.x (and not the newer Cython 3 alpha). See https://cython.readthedocs.io/en/stable/src/userguide/special_methods.html#arithmetic-methods
This means that you can’t rely on the first parameter of these methods being “self” or being the right type, and you should test the types of both operands before deciding what to do
It is likely treating self as untyped and thus accessing a and b as Python attributes.
The Cython 3 alpha treats special methods differently (see https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#arithmetic-methods) so you could also consider upgrading to that.
Although the call to __init__ has C typed arguements it's still a Python call so you can't avoid boxing and unboxing the arguments to Python ints. You could avoid this call and do something like:
cdef Pair res = Pair.__new__()
res.a = ... # direct assignment to attribute
Now I understand that Defining is to Types as Declaring is to Variables. But which one (Declare or Define) do functions/procedures/methods/subroutines fall under? Or do they have their own terminology?
In C and C++ you can declare a function (a function prototype) like this:
int function(int);
And then you can define it later, say, at the end of the file:
int function(int param) {
printf("This is the param: %d", param);
return 0;
}
So you can say that functions in C and C++ can fit into the terminology of both types and variables. It depends on the language you're using too, but this how I learned it.
Given the following:
open System.Linq
let seqA = { 1..10 }
this works:
seqA.All (fun n -> n > 0)
However this doesn't:
let abc = fun n -> n > 0
seqA.All (abc)
Why does F# offer implicit conversion from lambda expressions to Funcs but not from functions? Pointers to the documentation where I can read up on what's going on here are welcome. :-)
This is covered in the (rather involved) section of the spec on Method Resolution and again in Type-directed Conversions at Member Invocations. Quoting from the latter:
As described in Method Application Resolution (see §14.4), two
type-directed conversions are applied at method invocations.
The first type-directed conversion converts anonymous function
expressions and other function-valued arguments to delegate types.
Given:
A formal parameter of delegate type D
An actual argument farg of known type ty1 -> ... -> tyn -> rty
Precisely n arguments to the Invoke method of delegate type D
Then:
The parameter is interpreted as if it were written:
new D(fun arg1 ... argn -> farg arg1 ... argn)
It seems to suggest this conversion would be applied to any function value, but observation suggests it's applied only to anonymous functions.
I am reading about boost::function and I am a bit confused about its use and its relation to other C++ constructs or terms I have found in the documentation, e.g. here.
In the context of C++ (C++11), what is the difference between an instance of boost::function, a function object, a functor, and a lambda expression? When should one use which construct? For example, when should I wrap a function object in a boost::function instead of using the object directly?
Are all the above C++ constructs different ways to implement what in functional languages is called a closure (a function, possibly containing captured variables, that can be passed around as a value and invoked by other functions)?
A function object and a functor are the same thing; an object that implements the function call operator operator(). A lambda expression produces a function object. Objects with the type of some specialization of boost::function/std::function are also function objects.
Lambda are special in that lambda expressions have an anonymous and unique type, and are a convenient way to create a functor inline.
boost::function/std::function is special in that it turns any callable entity into a functor with a type that depends only on the signature of the callable entity. For example, lambda expressions each have a unique type, so it's difficult to pass them around non-generic code. If you create an std::function from a lambda then you can easily pass around the wrapped lambda.
Both boost::function and the standard version std::function are wrappers provided by the library. They're potentially expensive and pretty heavy, and you should only use them if you actually need a collection of heterogeneous, callable entities. As long as you only need one callable entity at a time, you are much better off using auto or templates.
Here's an example:
std::vector<std::function<int(int, int)>> v;
v.push_back(some_free_function); // free function
v.push_back(&Foo::mem_fun, &x, _1, _2); // member function bound to an object
v.push_back([&](int a, int b) -> int { return a + m[b]; }); // closure
int res = 0;
for (auto & f : v) { res += f(1, 2); }
Here's a counter-example:
template <typename F>
int apply(F && f)
{
return std::forward<F>(f)(1, 2);
}
In this case, it would have been entirely gratuitous to declare apply like this:
int apply(std::function<int(int,int)>) // wasteful
The conversion is unnecessary, and the templated version can match the actual (often unknowable) type, for example of the bind expression or the lambda expression.
Function Objects and Functors are often described in terms of a
concept. That means they describe a set of requirements of a type. A
lot of things in respect to Functors changed in C++11 and the new
concept is called Callable. An object o of callable type is an
object where (essentially) the expression o(ARGS) is true. Examples
for Callable are
int f() {return 23;}
struct FO {
int operator()() const {return 23;}
};
Often some requirements on the return type of the Callable are added
too. You use a Callable like this:
template<typename Callable>
int call(Callable c) {
return c();
}
call(&f);
call(FO());
Constructs like above require you to know the exact type at
compile-time. This is not always possible and this is where
std::function comes in.
std::function is such a Callable, but it allows you to erase the
actual type you are calling (e.g. your function accepting a callable
is not a template anymore). Still calling a function requires you to
know its arguments and return type, thus those have to be specified as
template arguments to std::function.
You would use it like this:
int call(std::function<int()> c) {
return c();
}
call(&f);
call(FO());
You need to remember that using std::function can have an impact on
performance and you should only use it, when you are sure you need
it. In almost all other cases a template solves your problem.
I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?
From the abstract of Google's MapReduce research publication page:
MapReduce is a programming model and
an associated implementation for
processing and generating large data
sets. Users specify a map function
that processes a key/value pair to
generate a set of intermediate
key/value pairs, and a reduce function
that merges all intermediate values
associated with the same intermediate
key.
The advantage of MapReduce is that the processing can be performed in parallel on multiple processing nodes (multiple servers) so it is a system that can scale very well.
Since it's based from the functional programming model, the map and reduce steps each do not have any side-effects (the state and results from each subsection of a map process does not depend on another), so the data set being mapped and reduced can each be separated over multiple processing nodes.
Joel's Can Your Programming Language Do This? piece discusses how understanding functional programming was essential in Google to come up with MapReduce, which powers its search engine. It's a very good read if you're unfamiliar with functional programming and how it allows scalable code.
See also: Wikipedia: MapReduce
Related question: Please explain mapreduce simply
Map is a function that applies another function to all the items on a list, to produce another list with all the return values on it. (Another way of saying "apply f to x" is "call f, passing it x". So sometimes it sounds nicer to say "apply" instead of "call".)
This is how map is probably written in C# (it's called Select and is in the standard library):
public static IEnumerable<R> Select<T, R>(this IEnumerable<T> list, Func<T, R> func)
{
foreach (T item in list)
yield return func(item);
}
As you're a Java dude, and Joel Spolsky likes to tell GROSSLY UNFAIR LIES about how crappy Java is (actually, he's not lying, it is crappy, but I'm trying to win you over), here's my very rough attempt at a Java version (I have no Java compiler, and I vaguely remember Java version 1.1!):
// represents a function that takes one arg and returns a result
public interface IFunctor
{
object invoke(object arg);
}
public static object[] map(object[] list, IFunctor func)
{
object[] returnValues = new object[list.length];
for (int n = 0; n < list.length; n++)
returnValues[n] = func.invoke(list[n]);
return returnValues;
}
I'm sure this can be improved in a million ways. But it's the basic idea.
Reduce is a function that turns all the items on a list into a single value. To do this, it needs to be given another function func that turns two items into a single value. It would work by giving the first two items to func. Then the result of that along with the third item. Then the result of that with the fourth item, and so on until all the items have gone and we're left with one value.
In C# reduce is called Aggregate and is again in the standard library. I'll skip straight to a Java version:
// represents a function that takes two args and returns a result
public interface IBinaryFunctor
{
object invoke(object arg1, object arg2);
}
public static object reduce(object[] list, IBinaryFunctor func)
{
if (list.length == 0)
return null; // or throw something?
if (list.length == 1)
return list[0]; // just return the only item
object returnValue = func.invoke(list[0], list[1]);
for (int n = 1; n < list.length; n++)
returnValue = func.invoke(returnValue, list[n]);
return returnValue;
}
These Java versions need generics adding to them, but I don't know how to do that in Java. But you should be able to pass them anonymous inner classes to provide the functors:
string[] names = getLotsOfNames();
string commaSeparatedNames = (string)reduce(names,
new IBinaryFunctor {
public object invoke(object arg1, object arg2)
{ return ((string)arg1) + ", " + ((string)arg2); }
}
Hopefully generics would get rid of the casts. The typesafe equivalent in C# is:
string commaSeparatedNames = names.Aggregate((a, b) => a + ", " + b);
Why is this "cool"? Simple ways of breaking up larger calculations into smaller pieces, so they can be put back together in different ways, are always cool. The way Google applies this idea is to parallelization, because both map and reduce can be shared out over several computers.
But the key requirement is NOT that your language can treat functions as values. Any OO language can do that. The actual requirement for parallelization is that the little func functions you pass to map and reduce must not use or update any state. They must return a value that is dependent only on the argument(s) passed to them. Otherwise, the results will be completely screwed up when you try to run the whole thing in parallel.
After getting most frustrated with either very long waffley or very short vague blog posts I eventually discovered this very good rigorous concise paper.
Then I went ahead and made it more concise by translating into Scala, where I've provided the simplest case where a user simply just specifies the map and reduce parts of the application. In Hadoop/Spark, strictly speaking, a more complex model of programming is employed that require the user to explicitly specify 4 more functions outlined here: http://en.wikipedia.org/wiki/MapReduce#Dataflow
import scalaz.syntax.id._
trait MapReduceModel {
type MultiSet[T] = Iterable[T]
// `map` must be a pure function
def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
(data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] =
data.flatMap(map)
def shufflePhase[K2, V2](mappedData: MultiSet[(K2, V2)]): Map[K2, MultiSet[V2]] =
mappedData.groupBy(_._1).mapValues(_.map(_._2))
// `reduce` must be a monoid
def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
(shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
shuffledData.flatMap(reduce).map(_._2)
def mapReduce[K1, K2, V1, V2, V3](data: MultiSet[(K1, V1)])
(map: ((K1, V1)) => MultiSet[(K2, V2)])
(reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)]): MultiSet[V3] =
mapPhase(map)(data) |> shufflePhase |> reducePhase(reduce)
}
// Kinda how MapReduce works in Hadoop and Spark except `.par` would ensure 1 element gets a process/thread on a cluster
// Furthermore, the splitting here won't enforce any kind of balance and is quite unnecessary anyway as one would expect
// it to already be splitted on HDFS - i.e. the filename would constitute K1
// The shuffle phase will also be parallelized, and use the same partition as the map phase.
abstract class ParMapReduce(mapParNum: Int, reduceParNum: Int) extends MapReduceModel {
def split[T](splitNum: Int)(data: MultiSet[T]): Set[MultiSet[T]]
override def mapPhase[K1, K2, V1, V2](map: ((K1, V1)) => MultiSet[(K2, V2)])
(data: MultiSet[(K1, V1)]): MultiSet[(K2, V2)] = {
val groupedByKey = data.groupBy(_._1).map(_._2)
groupedByKey.flatMap(split(mapParNum / groupedByKey.size + 1))
.par.flatMap(_.map(map)).flatten.toList
}
override def reducePhase[K2, V2, V3](reduce: ((K2, MultiSet[V2])) => MultiSet[(K2, V3)])
(shuffledData: Map[K2, MultiSet[V2]]): MultiSet[V3] =
shuffledData.map(g => split(reduceParNum / shuffledData.size + 1)(g._2).map((g._1, _)))
.par.flatMap(_.map(reduce))
.flatten.map(_._2).toList
}
Map is a native JS method that can be applied to an array. It creates a new array as a result of some function mapped to every element in the original array. So if you mapped a function(element) { return element * 2;}, it would return a new array with every element doubled. The original array would go unmodified.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map
Reduce is a native JS method that can also be applied to an array. It applies a function to an array and has an initial output value called an accumulator. It loops through each element in the array, applies a function, and reduces them to a single value (which begins as the accumulator). It is useful because you can have any output you want, you just have to start with that type of accumulator. So if I wanted to reduce something into an object, I would start with an accumulator {}.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/Reduce?v=a