ES6 Maps and Sets: how are object keys indexed efficiently? - ecmascript-6

In ES6, Maps and Sets can use Objects as keys. However since the ES6 specification does not dictate the underlying implementation of these datastructures, I was wondering how does the modern JS engines store the keys in order to guarantee O(1) or at least sublinear retrieval?
In a language like Java, the programmer can explicitly provide a (good) hashCode method which would hash the keys evenly in the key space in order to guarantee the performance. However since JS does not have such features, would it still be fair to still assume they use some sort of hashing in the Maps and Sets implementation?
Any information will be appreciated!

Yes, the implementation is based on hashing, and has (amortized) constant access times.
"they use object identity" is a simplification; the full story is that ES Maps and Sets use the SameValueZero algorithm for determining equality.
In line with this specification, V8's implementation computes "real" hashes for strings and numbers, and chooses a random number as "hash" for objects, which it stores as a private (hidden) property on these objects for later accesses. (That's not quite ideal and might change in the future, but for now that's what it is.)
Using memoryAddress % keySpace cannot work because the garbage collector moves objects around, and rehashing all Maps and Sets every time any object might have moved would be prohibitively complicated and expensive.

Related

If a CPU's stack is 1MB and a programming language is purely pass by value, how can we pass data bigger than 1MB into functions? What happens exactly?

So I hear that using pass by value, copies of the parameters are added to the call stack. Apparently, the stack size on Windows is often 1MB. Obviously though, we can easily pass around data that is far bigger than 1MB between functions/procedures (arrays of primitives/classes/hashmaps/sets or whatever)
So is my understanding of the situation wrong...or are these languages using pass by value for primitive types, but then pass by reference/pass by object model for these other data structures?
Just with a quick google, both Java & JavaScript are exclusively pass by value...so how are you able to pass around data/objects bigger than the stack size in these languages? For example, in JavaScript, it's stated: "For Array the maximum length is 4GB-1 (2^32-1)"
The difference between "pass-by-reference" and "pass-by-value" is that, in pass-by-reference calling convention, the function can change what the reference points to. In languages that don't support a pass-by-reference calling convention, you won't be able to do this directly; you'd have to simulate it by passing a mutable type that contains a reference to another object.
In pass-by-value, you pass references as values too. So the thing that's allocated on the stack, when you're passing an object that isn't a primitive value, is a copy of the pointer, not a deep copy of the data.
The aspect of this that gets confusing is the thing (or instance) of what you point to, if it's a mutable type, can be mutated.

How does a tracing generational GC determine garbage in the young generation?

Lets assume we have a simple generational GC with only two generations, the "old" generation (objects who survived at least one collection) and the "young" generation (newly allocated). So how exactly would the GC determine a "young" object to be garbage without tracing the whole reference graph from the very roots? Or to put it a different way: What does the GC choose as roots for the trace when indending to collect the "young" generation only?
I'm interested in the general method but in specific examples from existing implementations as well.
Thanks!
There are a few techniques, which all boil down to maintaining knowledge of which old-gen objects (or ranges of old-gen memory) may contain references to young objects.
Pretty much all implementations I can think of maintain this knowledge by adding write barriers. Those write barriers trigger when a young-gen reference is stored in a old-gen object, and thereby cause execution of a small code snippet which remembers the new reference.
To store that knowledge, some GCs use card marking, where a compact bitmap is used to mark small-ish memory blocks as "contains references to younger generations". Others maintain explicit "remembered sets", which does something similar for individual objects. In both cases, young-gen collections then add the objects in the (remembered set/memory blocks marked by the card table) to the roots.
As for specific implementations:
Mono uses remembered sets.
PyPy has several GCs, the newest and shiniest (Minimark) uses remembered sets, with the addition of card marking for individual large arrays.
.NET uses card marking.

Why is memoization not a language feature?

I was wondering: why is memoization not provided natively as a language feature in any language I know about?
Edit: to clarify, what I mean is that the language provides a keyword to specify a given function as memoizable, not that every function is automatically memoized "by default" unless specified otherwise. For example, Fortran provides the keyword PURE to specify a specific function as such. I guess that the compiler can take advantage of this information to memoize the call, but I ignore what happens if you declare PURE a function with side effects.
What YOU want from memoization may not be the same as what the compiler memoization option would provide.
You may know that it is only profitable to memoize the last 10 or so distinct values computed, because you know how the function will be used.
You may know that it only makes sense to memoize the last 2 or 3 values, because you will never use values older than that. (Fibonacci's Sequence comes to mind.)
You may be generating a LOT of values on some runs, and just a few on others.
You may want to "throw away" some of the memoized values and start over. (I memoized a random number generator this way, so I could replay the sequence of random numbers that built a certain structure, while some other parameters of the structure had been changed.)
Memoization as an optimization depends on the search for the memoized value being a lot cheaper than recomputation of the value. This in turn depends on the ordering of the input requests. This has implications for the memoization database: Does it use a stack, an array of all possible input values (which may be very large), a bucket hash, or a b-tree?
The memoizing compiler has to either provide a "one size fits all" memoization, or it has to provide lots of possible alternatives, and parameters to control the alternatives. At some point, it becomes easier for everyone to require the user to provide his own memoization.
Because compilers have to emit semantically correct programs. You can't memoize a function without changing program semantics unless it is referentially transparent. In most programming languages not all functions are referentially transparent (pure functional programming languages are an exception) so you can't memoize everything. But then a mechanism is needed for detecting referential transparency and that is too hard.
In Haskell, memoization is automatic for (pure) functions you've defined that take no arguments. And the Fibonacci example in that Wiki is really about the simplest demonstrable example I would be able to think of either.
Haskell can do this because your pure functions are defined to produce the same results every time; of course, monadic functions that depend on side effects won't be memoized.
I'm not sure what the upper limits are -- obviously, it won't memoize more than the available memory. And I'm also not sure offhand if the memoization occurs at compile-time (if the values can be determined at compile-time), or if it always occurs the first time the function is called.
Clojure has a memoize function (http://richhickey.github.com/clojure/clojure.core-api.html#clojure.core/memoize):
memoize
function
Usage: (memoize f)
Returns a memoized version of a referentially transparent function. The
memoized version of the function keeps a cache of the mapping from arguments
to results and, when calls with the same arguments are repeated often, has
higher performance at the expense of higher memory use.
A) Memoization trades space for time. I imagine that this can turn out to a fairly unbound property, in the sense, that the amount of data programs or libraries would have to store could consume large parts of memory really quick.
For a couple of languages, memoization is easy to implement and easy to customize for the given requirements.
As an example take some natural language processing on large bodies of text, where you don't want to compute basic properties of texts (word count, frequency, cooccurrences, ...) over and over again. In that case a memoization in combination with object serialization can be useful as opposed to memory caching, since you may run your application multiple times on unchanged corpora.
B) Another aspect: It's not true, that all functions or methods yield the same output for a same given input. Anyway some keyword or syntax for memoization would be necessary, along with configuration (memory limits, invalidation policy, ...) ...
Because you shouldn't implement something as a language feature when it can easily be implemented in the language itself. A memoization feature belongs in a library, which is exactly where most languages put it.
Your question also leaves open the solution of your learning more languages. I think that Lisp supports memoization, and I know that Mathematica does.
In order for memoization to work as a language feature there would be a couple requirements.
The compiler would need to be identify valid functions for memoization (e.g. they are referentially transparent).
The run-time would have to be able to intelligently select candidates for memoization without slowing down the overall performance.
There are some assumptions in the other language, but if we can have performance gains by just-in-time compilation of hot-spots in a Java VM, then one can surely write an automated memoziation system.
While non-trivial I think this is all theoretically possible to get performance gains in a language (especially an interpreted one) and is a worthwhile area for research.
Not all the languages natively support function decorators. I guess it would be a more general approach to support rather than supporting just memoization.
Reverse the question. Why it should? As someone has said, it can be put in a library so no need of add syntax to the language, it's only usable on pure functions which are hard to identify automatically(unless you force the programmer to annotate them). It's also very hard to determine if memoization is going to speed up things or not. I don't think it's a desirable feature for a programming language.
I really think such an option should be.
In data processing tasks there is an immutable input data (as time series, for example, where for a given time as soon as a value is known, it can never change). Taking in mind today RAM affordability, if a function result only depends on such immutable data, it is rational to memoize it rather than reread every time it's needed. Currently I have (in Scala and C#) to manually introduce an in-memory storage table and write 3 functions instead of one - one reading a value from file/db/ws, one storing it into an in-memory table, one to wrap them and read from memory if available or call the raw function if not. I think this could and should be implemented as a keyword and done behind the scenes.

Can garbage collection coexist with explicit memory management?

For example, say one was to include a 'delete' keyword in C# 4. Would it be possible to guarantee that you'd never have wild pointers, but still be able to rely on the garbage collecter, due to the reference-based system?
The only way I could see it possibly happening is if instead of references to memory locations, a reference would be an index to a table of pointers to actual objects. However, I'm sure that there'd be some condition where that would break, and it'd be possible to break type safety/have dangling pointers.
EDIT: I'm not talking about just .net. I was just using C# as an example.
You can - kind of: make your object disposable, and then dispose it yourself.
A manual delete is unlikely to improve memory performance in a managed environment. It might help with unmanaged ressources, what dispose is all about.
I'd rather have implementing and consuming Disposable objects made easier. I have no consistent, complete idea how this should look like, but managing unmanaged ressources is a verbose pain under .NET.
An idea for implementing delete:
delete tags an object for manual deletion. At the next garbage collection cycle, the object is removed and all references to it are set to null.
It sounds cool at first (at least to me), but I doubt it would be useful.
This isn't particulary safe, either - e.g. another thread might be busy executing a member method of that object, such an methods needs to throw e.g. when accessing object data.
With garbage collection, as long as you have a referenced reference to the object, it stays alive. With manual delete you can't guarantee that.
Example (pseudocode):
obj1 = new instance;
obj2 = obj1;
//
delete obj2;
// obj1 now references the twilightzone.
Just to be short, combining manual memory management with garbage collection defeats the purpose of GC. Besides, why bother? And if you really want to have control, use C++ and not C#. ;-).
The best you could get would be a partition into two “hemispheres” where one hemisphere is managed and can guarantee the absence of dangling pointers. The other hemisphere has explicit memory management and gives no guarantees. These two can coexist, but no, you can't give your strong guarantees to the second hemisphere. All you could do is to track all pointers. If one gets deleted, then all other pointers to the same instance could be set to zero. Needless to say, this is quite expensive. Your table would help, but introduce other costs (double indirection).
Chris Sells also discussed this on .NET Rocks. I think it was during his first appearance but the subject might have been revisited in later interviews.
http://www.dotnetrocks.com/default.aspx?showNum=10
My first reaction was: Why not? I can't imagine that you want to do is something as obscure as just leave an unreferenced chunk out on the heap to find it again later on. As if a four-byte pointer to the heap were too much to maintain to keep track of this chunk.
So the issue is not leaving unreferenced memory allocated, but intentionally disposing of memory still in reference. Since garbage collection performs the function of marking the memory free at some point, it seems that we should just be able to call an alternate sequence of instructions to dispose of this particular chunk of memory.
However, the problem lies here:
String s = "Here is a string.";
String t = s;
String u = s;
junk( s );
What do t and u point to? In a strict reference system, t and u should be null. So that means that you have to not only do reference counting, but perhaps tracking as well.
However, I can see that you should be done with s at this point in your code. So junk can set the reference to null, and pass it to the sweeper with a sort of priority code. The gc could be activated for a limited run, and the memory freed only if not reachable. So we can't explicitly free anything that somebody has coded to use in some way again. But if s is the only reference, then the chunk is deallocated.
So, I think it would only work with a limited adherence to the explicit side.
It's possible, and already implemented, in non-managed languages such as C++. Basically, you implement or use an existing garbage collector: when you want manual memory management, you call new and delete as normal, and when you want garbage collection, you call GC_MALLOC or whatever the function or macro is for your garbage collector.
See http://www.hpl.hp.com/personal/Hans_Boehm/gc/ for an example.
Since you were using C# as an example, maybe you only had in mind implementing manual memory management in a managed language, but this is to show you that the reverse is possible.
If the semantics of delete on a object's reference would make all other references referencing that object be null, then you could do it with 2 levels of indirection (1 more than you hint). Though note that while the underlying object would be destroyed, a fixed amount of information (enough to hold a reference) must be kept live on the heap.
All references a user uses would reference a hidden reference (presumably living in a heap) to the real object. When doing some operation on the object (such as calling a method or relying on its identity, wuch as using the == operator), the reference the programmer uses would dereference the hidden reference it points to. When deleting an object, the actual object would be removed from the heap, and the hidden reference would be set to null. Thus the references programmers would see evaluate to null.
It would be the GC's job to clean out these hidden references.
This would help in situations with long-lived objects. Garbage Collection works well when objects are used for short periods of time and de-referenced quickly. The problem is when some objects live for a long time. The only way to clean them up is to perform a resource-intensive garbage collection.
In these situations, things would work much easier if there was a way to explicitly delete objects, or at least a way to move a graph of objects back to generation 0.
Yes ... but with some abuse.
C# can be abused a little to make that happen.
If you're willing to play around with the Marshal class, StructLayout attribute and unsafe code, you could write your very own manual memory manager.
You can find a demonstration of the concept here: Writing a Manual Memory Manager in C#.

Garbage collection and runtime type information

The fixnum question brought my mind to an other question I've wondered for a long time.
Many online material about garbage collection does not tell about how runtime type information can be implemented. Therefore I know lots about all sorts of garbage collectors, but not really about how I can implement them.
The fixnum solution is actually quite nice, it's very clear which value is a pointer and which isn't. What other commonly used solutions for storing type information there is?
Also, I wonder about fixnum -thing. Doesn't that mean that you are being limited to fixnums on every array index? Or is there some sort of workaround for getting full 64-bit integers?
Basically to achieve accurate marking you need meta-data indicating which words are used as pointers and which are not.
This meta-data could be stored per reference, as emacs does. If for your language/implementation you don't care much about memory use, you could even make references bigger than words (perhaps twice as big), so that every reference can carry type information as well as its one-word data. That way you could have a fixnum the full size of a 32 bit pointer, at the cost of references all being 64 bit.
Alternatively, the meta-data could be stored along with other type information. So for example a class could contain, as well as the usual function pointer table, one bit per word of the data layout indicating whether or not the word contains a reference that should be followed by the garbage collector. If your language has virtual calls then you must already have a means of working out from an object what function addresses to use, so the same mechanism will allow you to work out what marking data to use - typically you add an extra, secret pointer at the start of every single object, pointing to the class which constitutes its runtime type. Obviously with certain dynamic languages the type data pointed to would need to be copy-on-write, since it is modifiable.
The stack can do similar - store the accurate marking information in data sections of the code itself, and have the garbage collector examine the stored program counter, and/or link pointers on the stack, and/or other information placed on the stack by the code for the purpose, to determine which code each bit of stack relates to and hence which words are pointers. Lightweight exception mechanisms tend to do a similar thing to store information about where try/catch occurs in the code, and of course debuggers need to be able to interpret the stack too, so this can quite possibly be folded in with a bunch of other stuff you'd already be doing to implement any language, including ones with built-in garbage collection.
Note that garbage collection doesn't necessarily need accurate marking. You could treat every word as a pointer, regardless of whether it really is or not, look it up in your garbage collector's "big list of everything" to decide whether it plausibly could refer to an object that has not yet been marked, and if so treat it as a reference to that object. This is simple, but the cost of course is that it's somewhere between "quite slow" and "very slow", depending on what data structures your gc uses for the lookup. Furthermore, sometimes an integer just so happens to have the same value as the address of an unreferenced object, and causes you to keep a whole bunch of objects which should have been collected. So such a garbage collector cannot offer strong guarantees about unreferenced objects ever being collected. This might be fine for a toy implementation or first working version, but is unlikely to be popular with users.
A mixed approach might, say, do accurate marking of objects, but not of regions of the stack where things get particularly hairy. For example if you write a JIT which can create code where a referenced object address appears only in registers, not in your usual stack slots, then you might need to non-accurately follow the region of the stack where the OS stored the registers when it descheduled the thread in question to run the garbage collector. Which is probably quite fiddly, so a reasonable approach (potentially resulting in slower code) would be to require the JIT to always keep a copy of all pointer values it's using on the accurately marked stack.
In Squeak (also Scheme and many others dynamic languages I guess) you have SmallInteger, the class of signed 31-bit integers, and classes for arbitrarily big integers, e.g. LargePositiveInteger. There could very well be other representations, 64-something-bit integers either as full objects or with a couple bits as "I'm not a pointer" flags.
But arithmetic methods are coded to handle over/under-flows, such that if you add one to SmallInteger maxVal, you get 2^30 + 1 as an instance of LargePositiveInteger, and if you subtract one back from it, you get back 2^30 as a SmallInteger.