How are implemented classes in dynamic languages? - language-agnostic

How are implemented classes in dynamic languages ?
I know that Javascript is using a prototype pattern (there is 'somewhere' a container of unbound JS functions, which are bind when calling them through an object), but I have no idea of how it works in other languages.
I'm curious about this, because I can't think of an efficient way to have native bound methods without wasting memory and/or cpu by copying members for each instance.
(By bound method, I mean that the following code should work :)
class Foo { function bar() : return 42; };
var test = new Foo();
var method = test.bar;
method() == 42;

This highly depends on the language and the implementation. I'll tell you what I know about CPython and PyPy.
The general idea, which is also what CPython does for the most part, goes like this:
Every object has a class, specifically a reference to that class object.
Apart from instance members, which are obviously stored in the individual object, the class also has members. This includes methods, so methods don't have a per-object cost.
A class has a method resolution order (MRO) determined by the inheritance relationships, wherein each base class occurs exactly once. If we didn't have multiple inheritance, this would simply be a reference to the base class, but this way the MRO is hard to figure out on the fly (you'd have to start from the most derived class every time).
(Classes are also objects and have classes themselves, but we'll gloss over that for now.)
If attribute lookup on an object fails, the same attribute is looked up on the classes in the MRO, in the order specified by the MRO. (This is the default behavior, which can be changed by defining magic methods like __getattr__ and __getattribute__.)
So far so simple, and not really an explanation for bound methods. I just wanted to make sure we're talking about the same thing. The missing piece is descriptors. The descriptor protocol is defined in the "deep magic" section of the language reference, but the short and simple story is that lookup on a class can be hijacked by the object it results in via a __get__ method. More importantly, this __get__ method is told whether the lookup started on an instance or on the "owner" (the class).
In Python 2, we have an ugly and unnecessary UnboundMethod descriptor which (apart from the __get__ method) simply wraps the function to throw errors on Class.method(self) if self is not of an acceptable type. In Python 3, the __get__ is simply part of all function objects, and unbound methods are gone. In both cases, the __get__ method returns itself when you look it up on a class (so you can use Class.method, which is useful in a few cases) and a "bound method" object when you look it up on an object. This bound method object does nothing more than storing the raw function and the instance, and passing the latter as first argument to the former in its __call__ (special method overriding the function call syntax).
So, for CPython: While there is a cost to bound methods, it's smaller than you might think. Only two references are needed space-wise, and the CPU cost is limited to a small memory allocation, and an extra indirection when calling. Note though that this cost applies to all method calls, not just those which actually make use of bound method features. a.f() has to call the descriptor and use its return value, because in a dynamic language we don't know if it's monkey-patched to do something different.
In PyPy, things are more interesting. As it's an implementation which doesn't compromise on correctness, the above model is still correct for reasoning about semantics. However, it's actually faster. Apart from the fact that the JIT compiler inlines and then eliminates the entire mess described above in most cases, they also tackle the problem on bytecode level. There are two new bytecode instructions, which preserve the semantics but omit the allocation of the bound method object in the case of a.f(). There is also a method cache which can simplify the lookup process, but requires some additional bookkeeping (though some of that bookkeeping is already done for the JIT).

Related

Code optimization - Unused methods

How can I tell if a method will never be used ?
I know that for dll files and libraries you can't really know if someone else (another project) will ever use the code.
In general I assume that anything public might be used somewhere else.
But what about private methods ? Is it safe to assume that if I don't see an explicit call to that method, it won't be used ?
I assume that for private methods it's easier to decide. But is it safe to decide it ONLY for private methods ?
Depends on the language, but commonly, a name that occurs once in the program and is not public/exported is not used. There are exceptions, such as constructors and destructors, operator overloads (in C++ and Python, where the name at the point of definition does not match the name at the call site) and various other methods.
For example, in Python, to allow indexing (foo[x]) to work, you define a method __getitem__ in the class to which foo belongs. But hardly ever would you call __getitem__ explicitly.
What you need to know is the (or all possible) entry point(s) to your code:
For a simple command line program, this is the "main" method or, in the most simple case, the top of your script.
For libraries, in fact, it is everything visible from outside.
The situation turns more complicated if methods can be referenced from outside by means of introspection. This is language specific and requires knowledge into details of the techniques used.
What you need to do is follow all references from all entry points recursively to mark up all used methods. Whatever remains unmarked can safely - and should - be removed.
Since this is a diligent but routine piece of work, there are tools available which do that for various programming languages. Examples include ReSharper for C# or ProGuard for Java.

What's the difference between closures and traditional classes?

What are the pros and cons of closures against classes, and vice versa?
Edit:
As user Faisal put it, both closures and classes can be used to "describe an entity that maintains and manipulates state", so closures provide a way to program in an object oriented way using functional languages. Like most programmers, I'm more familiar with classes.
The intention of this question is not to open another flame war about which programming paradigm is better, or if closures and classes are fully equivalent, or poor man's one-another.
What I'd like to know is if anyone found a scenario in which one approach really beats the other, and why.
Functionally, closures and objects are equivalent. A closure can emulate an object and vice versa. So which one you use is a matter of syntactic convenience, or which one your programming language can best handle.
In C++ closures are not syntactically available, so you are forced to go with "functors", which are objects that override operator() and may be called in a way that looks like a function call.
In Java you don't even have functors, so you get things like the Visitor pattern, which would just be a higher order function in a language that supports closures.
In standard Scheme you don't have objects, so sometimes you end up implementing them by writing a closure with a dispatch function, executing different sub-closures depending on the incoming parameters.
In a language like Python, the syntax of which has both functors and closures, it's basically a matter of taste and which you feel is the better way to express what you are doing.
Personally, I would say that in any language that has syntax for both, closures are a much more clear and clean way to express objects with a single method. And vice versa, if your closure starts handling dispatch to sub-closures based on the incoming parameters, you should probably be using an object instead.
Personally, I think it's a matter of using the right tool for the job...more specifically, of properly communicating your intent.
If you want to explicitly show that all your objects share a common definition and want strong type-checking of such, you probably want to use a class. The disadvantage of not being able to alter the structure of your class at runtime is actually a strength in this case, since you know exactly what you're dealing with.
If instead you want to create a heterogeneous collection of "objects" (i.e. state represented as variables closed under some function w/inner functions to manipulate that data), you might be better off creating a closure. In this case, there's no real guarantee about the structure of the object you end up with, but you get all the flexibility of defining it exactly as you like at runtime.
Thank you for asking, actually; I'd responded with a sort of knee-jerk "classes and closures are totally different!" attitude at first, but with some research I realize the problem isn't nearly as cut-and-dry as I'd thought.
Closures are very lightly related to classes. Classes let you define fields and methods, and closures hold information about local variables from a function call. There is no possible comparison of the two in a language-agnostic manner: they don't serve the same purpose at all. Besides, closures are much more related to functional programming than to object-oriented programming.
For instance, look at the following C# code:
static void Main(String[] args)
{
int i = 4;
var myDelegate = delegate()
{
i = 5;
}
Console.WriteLine(i);
myDelegate();
Console.WriteLine(i);
}
This gives "4" then "5". myDelegate, being a delegate, is a closure and knows about all the variables currently used by the function. Therefore, when I call it, it is allowed to change the value of i inside the "parent" function. This would not be permitted for a normal function.
Classes, if you know what they are, are completely different.
A possible reason of your confusion is that when a language has no language support for closures, it's possible to simulate them using classes that will hold every variable we need to keep around. For instance, we could rewrite the above code like this:
class MainClosure()
{
public int i;
void Apply()
{
i = 5;
}
}
static void Main(String[] args)
{
MainClosure closure;
closure.i = 4;
Console.WriteLine(closure.i);
closure.Apply();
Console.WriteLine(closure.i);
}
We've transformed the delegate to a class that we've called MainClosure. Instead of creating the variable i inside the Main function, we've created a MainClosure object, that has an i field. This is the one we'll use. Also, we've built the code the function executes inside an instance method, instead of inside the method.
As you can see, even though this was an easy example (only one variable), it is considerably more work. In a context where you want closures, using objects is a poor solution. However, classes are not only useful for creating closures, and their usual purpose is usually far different.

Using one method with constants as parameters versus several methods

In Kent Beck's Implementation Patterns, one can read
"A common use of constants is to
communicate variations of a message in
an interface. For example, to center
text you could invoke
setJustification(Justification.CENTERED).
One advantage of this style of API is
that you can add new variants of
existing methods by adding new
constants without breaking
implementors. However, these messages
don't communicate as well as having a
separate method for each variation. In
this style, the message above would be
justifyCentered(). An interface where
all invocations of a method have
literal constants as arguments can be
improved by giving it separate methods
for each constant value."
Why is this? Generally when I'm coding and I notice that I have a couple of similar parameterless methods that could be reduced to just one, with an argument, like in the following example,
void justifyRight()
void justifyLeft()
void justifyCentered()
I'd generally do just the opposite of what Kent advices, which would be to group it into
setJustification(Justification justification)
How do you usually handle this situation? Is this totally subjective or there is really a very strong reason that I can't see in favour of Kent's view of this matter?
Thanks
File access methods usually have parameters regarding read/write mode, whether to create non-existing files, security attributes, locking modes and so on. Imagine the amount of methods you'd have if you'd create a separate method for each valid combination of parameters!
I've highlighted the biggest argument in favor of separate methods; it's fail-safe because you have strict control over the API. The caller cannot pass in invalid arguments, or invalid combinations of parameters, if you don't expose such parameters. This also implies less complex parameter validation.
However, I'm not in favor of this practice. API's should be well-designed and should change as little as possible. Kent Beck on breaking API changes:
One advantage of [parameterized methods] is that you can add new variants of existing methods by adding new constants without breaking implementors.
His argument in favor of separate methods is:
However, [parameterized methods] don't communicate as well as having a separate method for each variation.
I disagree. Method parameters can be just as readable. Especially in combination with named parameters, a feature which is supported by several languages. Besides, separate methods would result in a cluttered API.
I suppose it's subjective. Some may argue that justifyLeft is clearer than justify(Justification.LEFT) Collapsing it all into one method may result in a nicer API - less clutter - and the mode can be stored in a variable and simply feeding it to the single setXY method (with different methods for each, you'd have to decide which to call depending on the value manually). Therefore I usually prefer this way way. Though it's usually just:
void justify(Justification justification) {
switch(justification) {
Justification.RIGHT: this.justifyRight();
Justification.LEFT: this.justifyLeft();
Justification.CENTERED: this.justifyCenter();
}
}
Of course this is only advisable when all these methods are very closely related.

To init or to construct

I'm reviewing some code and I'm seeing a lot of this:
class Foo
{
public:
Foo()
{
// 'nuffin
}
void init()
{
// actual construction code
}
} ;
The only advantage I can see is if you create a Foo without using a pointer and you want to hold off its construction code until later, then you can.
Is this a good idea or a bad idea?
I dislike it. It seems to me that after construction, an object should be... well... constructed. That code leaves it in an invalid state instead, which is almost1 never a good thing.
1 Weasel word inserted to account for unforeseen circumstances.
In general, I agree that it's something to be avoided. But something none of the answers so far have addressed is the possibility that initialization may fail. Constructors cannot fail, so if your constructor allocates memory, or opens a file, or does anything else that may fail, you need a way to tell the caller that an error occurred. If you do the initialization in the constructor, then you need to have a flag that indicates whether or not the initialization succeeded, and then ensure that the caller checks that flag.
If you have a separate init() routine that must be called before anything else works, callers are more likely to check that return code than to call a didInitializationSucceed() method after creating the object.
Two-stage construction is generally considered a bad idea, if there are methods on the class which rely on the object being in some initialised state. Generally, I prefer constructors which guarantee the object is in a good state, or if that cannot be done (perhaps because some of the arguments to the constructor were invalid), throw an exception, so there is never an instances of your class which is in a bad state.
Requiring consumers of your object to remember to call init() is a bad idea, because they won't.
One case where this may apply is when 'Foo' is a attribute of another class and cannot be fully constructed before the parent-class is done. Only then can 'Foo' be 'filled-in'.
I believe constructor should basically do the init() part as well. Unless the object is fully constructed, it shouldn't be used.
Also, initializing in constructor allows you to make use of RAII. The basic point of RAII is to represent a resource by a local object, initialize in constructor, so that the local object's destructor will release the resource. That way, the programmer cannot forget to release the resource.
In some languages (read: C++) you can't call a constructor from another constructor, so if you want a common part of several constructors you need to put it in a separate method, and I've seen the name init() used for that. But that is not what you're talking about?
I use the contructor and init if I am instantiating objects that are based on a database call. So, if I need an empty object so that I can populate it and then save it to the database, I construct with no parameters and don't call init(). Whereas if I need to retrieve the object members from the db, I'll contruct($param) and pass the $param to init($param).
Generally, source code should be as simple as possible, and your example is presented without the context, so it's just more complicated than necessary and therefore your example is a bad idea.
However, there may be some semantic contexts, where it may make sense to be able to deliver uninitialized objects - for instance, if the context requires a container to have objects but you don't want to initialize them until later because initialization is slow and/or maybe the objects are not needed. In those cases, the additional complexity may make something else simpler.
Although it cannot be considered a normal or preferred way of constructing objects, under some circumstances that maybe a way to go. For example you may need to construct an object just to indicate its existence (in some list, where count does matter etc.), but to init it later only if this particular object is used for the first time as initializing whole collection of objects would take to much time.
In that case it's good to expose the fact that object may be not initialized by including a method like isInitialized(). Also that way you can transfer initialization to another thread in order not to block the main thread of the application.
The difference is that initialization happens after the call to the super class's constructor but before any code is executed in your local class constructor. Therefore, it really depends on your needs.

How to design a class that has only one heavy duty work method and data returning other methods?

I want to design a class that will parse a string into tokens that are meaningful to my application.
How do I design it?
Provide a ctor that accepts a string, provide a Parse method and provide methods (let's call them "minor") that return individual tokens, count of tokens etc. OR
Provide a ctor that accepts nothing, provide a Parse method that accepts a string and minor methods as above. OR
Provide a ctor that accepts a string and provide only minor methods but no parse method. The parsing is done by the ctor.
1 and 2 have the disadvantage that the user may call minor methods without calling the Parse method. I'll have to check in every minor method that the Parse method was called.
The problem I see in 3 is that the parse method may potentially do a lot of things. It just doesn't seem right to put it in the ctor.
2 is convenient in that the user may parse any number of strings without instantiating the class again and again.
What's a good approach? What are some of the considerations?
(the language is c#, if someone cares).
Thanks
I would have a separate class with a Parse method that takes a string and converts it into a separate new object with a property for each value from the string.
ValueObject values = parsingClass.Parse(theString);
I think this is a really good question...
In general, I'd go with something that resembles option 3 above. Basically, think about your class and what it does; does it have any effective data other than the data to parse and the parsed tokens? If not, then I would generally say that if you don't have those things, then you don't really have an instance of your class; you have an incomplete instance of your class; something which you'd like to avoid.
One of the considerations that you point out is that the parsing of the tokens may be a relatively computationally complicated process; it may take a while. I agree with you that you may not want to take the hit for doing that in the constructor; in that case, it may make sense to use a Parse() method. The question that comes in, though, is whether or not there's any sensible operations that can be done on your class before the parse() method completes. If not, then you're back to the original point; before the parse() method is complete, you're effectively in an "incomplete instance" state of your class; that is, it's effectively useless. Of course, this all changes if you're willing and able to use some multithreading in your application; if you're willing to offload the computationally complicated operations onto another thread, and maintain some sort of synchronization on your class methods / accessors until you're done, then the whole parse() thing makes more sense, as you can choose to spawn that in a new thread entirely. You still run into issues of attempting to use your class before it's completely parsed everything, though.
I think an even more broad question that comes into this design, though, is what is the larger scope in which this code will be used? What is this code going to be used for, and by that, I mean, not just now, with the intended use, but is there a possibility that this code may need to grow or change as your application does? In terms of the stability of implementation, can you expect for this to be completely stable, or is it likely that something about the set of data you'll want to parse or the size of the data to parse or the tokens into which you will parse will change in the future? If the implementation has a possibility of changing, consider all the ways in which it may change; in my experience, those considerations can strongly lead to one or another implementation. And considering those things is not trivial; not by a long shot.
Lest you think this is just nitpicking, I would say, at a conservative estimate, about 10 - 15 percent of the classes that I've written have needed some level of refactoring even before the project was complete; rarely has a design that I've worked on survived implementation to come out the other side looking the same way that it did before. So considering the possible permutations of the implementation becomes very useful for determining what your implementation should be. If, say, your implementation will never possibly want to vary the size of the string to tokenize, you can make an assumption about the computatinal complexity, that may lead you one way or another on the overall design.
If the sole purpose of the class is to parse the input string into a group of properties, then I don't see any real downside in option 3. The parse operation may be expensive, but you have to do it at some point if you're going to use it.
You mention that option 2 is convenient because you can parse new values without reinstantiating the object, but if the parse operation is that expensive, I don't think that makes much difference. Compare the following code:
// Using option 3
ParsingClass myClass = new ParsingClass(inputString);
// Parse a new string.
myClass = new ParsingClass(anotherInputString);
// Using option 2
ParsingClass myClass = new ParsingClass();
myClass.Parse(inputString);
// Parse a new string.
myClass.Parse(anotherInputString);
There's not much difference in use, but with Option 2, you have to have all your minor methods and properties check to see if parsing had occurred before they can proceed. (Option 1 requires to you do everything that option 2 does internally, but also allows you to write Option 3-style code when using it.)
Alternatively, you could make the constructor private and the Parse method static, having the Parse method return an instance of the object.
// Option 4
ParsingClass myClass = ParsingClass.Parse(inputString);
// Parse a new string.
myClass = ParsingClass.Parse(anotherInputString);
Options 1 and 2 provide more flexibility, but require more code to implement. Options 3 and 4 are less flexible, but there's also less code to write. Basically, there is no one right answer to the question. It's really a matter of what fits with your existing code best.
Two important considerations:
1) Can the parsing fail?
If so, and if you put it in the constructor, then it has to throw an exception. The Parse method could return a value indicating success. So check how your colleagues feel about throwing exceptions in situations which aren't show-stopping: default is to assume they won't like it.
2) The constructor must get your object into a valid state.
If you don't mind "hasn't parsed anything yet" being a valid state of your objects, then the parse method is probably the way to go, and call the class SomethingParser.
If you don't want that, then parse in the constructor (or factory, as Garry suggests), and call the class ParsedSomething.
The difference is probably whether you are planning to pass these things as parameters into other methods. If so, then having a "not ready yet" state is a pain, because you either have to check for it in every callee and handle it gracefully, or else you have to write documentation like "the parameter must already have parsed a string". And then most likely check in every callee with an assert anyway.
You might be able to work it so that the initial state is the same as the state after parsing an empty string (or some other base value), thus avoiding the "not ready yet" problem.
Anyway, if these things are likely to be parameters, personally I'd say that they have to be "ready to go" as soon as they're constructed. If they're just going to be used locally, then you might give users a bit more flexibility if they can create them without doing the heavy lifting. The cost is requiring two lines of code instead of one, which makes your class slightly harder to use.
You could consider giving the thing two constructors and a Parse method: the string constructor is equivalent to calling the no-arg constructor, then calling Parse.