For example, say one was to include a 'delete' keyword in C# 4. Would it be possible to guarantee that you'd never have wild pointers, but still be able to rely on the garbage collecter, due to the reference-based system?
The only way I could see it possibly happening is if instead of references to memory locations, a reference would be an index to a table of pointers to actual objects. However, I'm sure that there'd be some condition where that would break, and it'd be possible to break type safety/have dangling pointers.
EDIT: I'm not talking about just .net. I was just using C# as an example.
You can - kind of: make your object disposable, and then dispose it yourself.
A manual delete is unlikely to improve memory performance in a managed environment. It might help with unmanaged ressources, what dispose is all about.
I'd rather have implementing and consuming Disposable objects made easier. I have no consistent, complete idea how this should look like, but managing unmanaged ressources is a verbose pain under .NET.
An idea for implementing delete:
delete tags an object for manual deletion. At the next garbage collection cycle, the object is removed and all references to it are set to null.
It sounds cool at first (at least to me), but I doubt it would be useful.
This isn't particulary safe, either - e.g. another thread might be busy executing a member method of that object, such an methods needs to throw e.g. when accessing object data.
With garbage collection, as long as you have a referenced reference to the object, it stays alive. With manual delete you can't guarantee that.
Example (pseudocode):
obj1 = new instance;
obj2 = obj1;
//
delete obj2;
// obj1 now references the twilightzone.
Just to be short, combining manual memory management with garbage collection defeats the purpose of GC. Besides, why bother? And if you really want to have control, use C++ and not C#. ;-).
The best you could get would be a partition into two “hemispheres” where one hemisphere is managed and can guarantee the absence of dangling pointers. The other hemisphere has explicit memory management and gives no guarantees. These two can coexist, but no, you can't give your strong guarantees to the second hemisphere. All you could do is to track all pointers. If one gets deleted, then all other pointers to the same instance could be set to zero. Needless to say, this is quite expensive. Your table would help, but introduce other costs (double indirection).
Chris Sells also discussed this on .NET Rocks. I think it was during his first appearance but the subject might have been revisited in later interviews.
http://www.dotnetrocks.com/default.aspx?showNum=10
My first reaction was: Why not? I can't imagine that you want to do is something as obscure as just leave an unreferenced chunk out on the heap to find it again later on. As if a four-byte pointer to the heap were too much to maintain to keep track of this chunk.
So the issue is not leaving unreferenced memory allocated, but intentionally disposing of memory still in reference. Since garbage collection performs the function of marking the memory free at some point, it seems that we should just be able to call an alternate sequence of instructions to dispose of this particular chunk of memory.
However, the problem lies here:
String s = "Here is a string.";
String t = s;
String u = s;
junk( s );
What do t and u point to? In a strict reference system, t and u should be null. So that means that you have to not only do reference counting, but perhaps tracking as well.
However, I can see that you should be done with s at this point in your code. So junk can set the reference to null, and pass it to the sweeper with a sort of priority code. The gc could be activated for a limited run, and the memory freed only if not reachable. So we can't explicitly free anything that somebody has coded to use in some way again. But if s is the only reference, then the chunk is deallocated.
So, I think it would only work with a limited adherence to the explicit side.
It's possible, and already implemented, in non-managed languages such as C++. Basically, you implement or use an existing garbage collector: when you want manual memory management, you call new and delete as normal, and when you want garbage collection, you call GC_MALLOC or whatever the function or macro is for your garbage collector.
See http://www.hpl.hp.com/personal/Hans_Boehm/gc/ for an example.
Since you were using C# as an example, maybe you only had in mind implementing manual memory management in a managed language, but this is to show you that the reverse is possible.
If the semantics of delete on a object's reference would make all other references referencing that object be null, then you could do it with 2 levels of indirection (1 more than you hint). Though note that while the underlying object would be destroyed, a fixed amount of information (enough to hold a reference) must be kept live on the heap.
All references a user uses would reference a hidden reference (presumably living in a heap) to the real object. When doing some operation on the object (such as calling a method or relying on its identity, wuch as using the == operator), the reference the programmer uses would dereference the hidden reference it points to. When deleting an object, the actual object would be removed from the heap, and the hidden reference would be set to null. Thus the references programmers would see evaluate to null.
It would be the GC's job to clean out these hidden references.
This would help in situations with long-lived objects. Garbage Collection works well when objects are used for short periods of time and de-referenced quickly. The problem is when some objects live for a long time. The only way to clean them up is to perform a resource-intensive garbage collection.
In these situations, things would work much easier if there was a way to explicitly delete objects, or at least a way to move a graph of objects back to generation 0.
Yes ... but with some abuse.
C# can be abused a little to make that happen.
If you're willing to play around with the Marshal class, StructLayout attribute and unsafe code, you could write your very own manual memory manager.
You can find a demonstration of the concept here: Writing a Manual Memory Manager in C#.
Related
Which garbage collection algorithms can recognize garbage objects as soon as they become garbage?
The only thing which comes to my mind is reference counting with an added cycle-search every time a reference count is decremented to a non-zero value.
Are there any other interesting collection algorithms which can achieve that? (Note that I'm asking out of curiosity only; I'm aware that all such collectors would probably be incredibly inefficient)
Though not being a garbage collection algorithm, escape analysis allows reasoning about life-time of objects. So, if efficiency is a concern and objects should be collected not in all but in "obvious" cases, it can be handy. The basic idea is to perform static analysis of the program (at compile time or at load time if compiled for a VM) and to figure out if a newly created object may escape a routine it is created in (hence the name of the analysis). If the object is not passed anywhere else, not stored in global memory, not returned from the given routine, etc., it can be released before returning from this routine or even earlier, at the place of its last use.
The objects that do not live longer than the associated method call can be allocated on the stack rather than in the heap so they can be removed from the garbage collection cycle at compile time thus lowering pressure on the general GC.
Such a mechanism would be called "heap management", not garbage collection.
By definition, garbage collection is decoupled from heap management. This is because in some environments/applications it is more efficient to skip doing a "free" operation and keeping track of what is in use. Instead, every once in a while, just ago around and gather all the unreferenced nodes and put them back on the free list.
== Addendum ==
I am being downvoted for attempting to correct the terminology of heap management with garbage collection. The Wikipedia article agrees with my usage, and what I learned at university, even though that was several decades ago. Languages such as Lisp and Snobol invented the need for garbage collection. Languages such as C don't provide such a heavy duty runtime environment; instead the rely on the programmer to manage cleanup of unused bits of memory and resources.
On collection, the garbage collector copies all live objects into another memory space, thus discarding all garbage objects in the process. A forward pointer to the copied object in new space is installed into the 'old' version of an object to ensure the collector updates all remaining references to the object correctly and doesn't erroneously copy the same object twice.
This obviously works quite well for stop-the-world-collectors. However, since pause times are long with stop-the-world, nowadays most garbage collectors allow the mutator threads to run concurrently with the collector, only stopping the mutators for a short time to do the initial stack scan.
So how can the collector ensure that the 'old' version of an object is not accessed by the mutator while/after copying it? I imagine the mutators could check for the forward pointer with some sort of read barrier, however this seems to costly to me since variables are read so often.
The Loaded Value Barrier implemented in Azul's Generational Pauseless Garbage Collector is an example of a solution to this problem. You can read about it in the article The Azul Garbage Collector posted on InfoQ in early 2011.
You pretty much need to use a read barrier or a write barrier. You're apparently already aware of read barriers so I won't try to get into them.
Writer barriers work because as long as you prevent writes from happening, you simply don't care whether somebody accesses the old or the new copy of the data. You set the write barrier, copy the data, and then start adjusting pointers. After the copy is made, you don't really care whether somebody reads the old or the new copy of the data, because the write barrier ensures they're identical. Once you're done adjusting pointers, everything will be working with the new data, so you revoke the write barrier.
There has been some work done with using page protection bits to mark an area of memory as read-only to create a write-barrier on fairly standard hardware. At least the last time I looked into it, however, this was still pretty much at a proof of concept stage -- working, but too slow to be very practical.
Disclaimer: this is related to java's Shenandoah GC only.
Your reasons are absolutely spot on for Shenandoah! Some details here for example.
In the not so long ago days, Shenandoah had write and read barriers for all primitive and reference types. The read barrier was actually just a single indirection via the forwarding pointer as you assume. Since they are so much more compared to writes (which was a much more complicated barrier), these read barriers were more expensive (cumulative time-wise) vs the write ones. This had to do with the sole fact that these are so darn many.
But, things have changed in jdk-13, when a Load Reference barrier was implemented. Thus only loads have a barrier, write happen the usual way. If you think about it, this makes perfect sense, in order to write to an Object field, you need to read that object first, as such if your barrier preserves the "to-space invariant", you will always read from the most recent and correct copy of the object; without the need to use a read barrier with a forwarding pointer.
While I think I understand the gist of the problem (i.e. a good GC tracks objects, not scope), I don't know enough about the subject to convince others.
Can you give me an explanation on why there are no garbage-collected languages with deterministic destructors?
They are NOT mutually exclusive. Feel free to use C++ with libgc (Boehm-Reiser-Detlefs collector). You can still use RAII, smart pointers, and manual deletion, but with the GC running you can also just "forget" to delete some objects.
#Andy's answer regarding resources being disposed too late misses the important point: it isn't the delay releasing resources which is crucial semantically, but rather the order of release.
The reason GC tends not to order release well is that it would require a topological sort on ordering requirements (dependencies) and that's an expensive algorithm.
Nevertheless Ocaml GC has an interesting facility where you can attach a finaliser to an object. If the object becomes unreachable the finaliser is run, however the object is not deleted (because the finaliser could make it reachable again: in that case you can even attach another finaliser). These finalisers can provide some control over ordering.
From Wikipedia, after noting that tracing garbage collectors are the most common type:
Tracing garbage collection is not
deterministic. An object which becomes
eligible for garbage collection will
usually be cleaned up eventually, but
there is no guarantee when (or even
if) that will happen.
Therefore, relying on RAII could lead to the resource being disposed of too late.
As a result, for example, Java has a guideline to "avoid finalizers" (Item 6 in "Effective Java" by Josua Bloch). "Nothing time-critical should ever be done in a finalizer."
The garbage collector can't run all the time (refcounting gets closer, but generally doesn't count as garbage collection), so it doesn't even try. It's plain impractical. Therefore, there is an inevitable delay between an object becoming unreachable (e.g. because the only reference goes out of scope) and the GC collecting it, possibly firing a finalizer. This delay is not deterministic... unless (and then, deterministic destruction in the strictest sense of the word is possible, although still impractical) force the GC into a deterministic schedule - but this gets pretty close to "GC running all the time", which is still incredibly impractical.
So GC and deterministic cleanup are mutually exclusive because the GC does all the cleanup and it cannot afford do be deterministic but must rely on maximizing its efficiency.
Lets say you want to write a high performance method which processes a large data set.
Why shouldn't developers have the ability to turn on manual memory management instead of being forced to move to C or C++?
void Process()
{
unmanaged
{
Byte[] buffer;
while (true)
{
buffer = new Byte[1024000000];
// process
delete buffer;
}
}
}
Because allowing you to manually delete a memory block while there may still be references to it (and the runtime has no way of knowing that without doing a GC cycle) can produce dangling pointers, and thus break memory safety. GC languages are generally memory-safe by design.
That said, in C#, in particular, you can do what you want already:
void Process()
{
unsafe
{
byte* buffer;
while (true)
{
buffer = Marshal.AllocHGlobal(1024000000);
// process
Marshal.FreeHGlobal(buffer);
}
}
}
Note that, as in C/C++, you have full pointer arithmetic for raw pointer types in C# - so buffer[i] or buffer+i are valid expressions.
If you need high performance and detailed control, maybe you should write what you're doing in C or C++. Not all languages are good for all things.
Edited to add: A single language is not going to be good for all things. If you add up all the useful features in all the good programming languages, you're going to get a really big mess, far worse than C++, even if you can avoid inconsistency.
Features aren't free. If a language has a feature, people are likely to use it. You won't be able to learn C# well enough without learning the new C# manual memory management routines. Compiler teams are going to implement it, at the cost of other compiler features that are useful. The language is likely to become difficult to parse like C or C++, and that leads to slow compilation. (As a C++ guy, I'm always amazed when I compile one of our C# projects. Compilation seems almost instantaneous.)
Features conflict with each other, sometimes in unexpected ways. C90 can't do as well as Fortran at matrix calculations, since the possibility that C pointers are aliased prevents some optimizations. If you allow pointer arithmetic in a language, you have to accept its consequences.
You're suggesting a C# extension to allow manual memory management, and in a few cases that would be useful. That would mean that memory would have to be allocated in separate ways, and there would have to be a way to tell manually managed memory from automatically managed memory. Suddenly, you've complicated memory management, there's more chance for a programmer to screw up, and the memory manager itself is more complicated. You're gaining a little performance that matters in a few cases in exchange for more complication and slower memory management in all cases.
It may be that at some time we'll have a programming language that's good for almost all purposes, from scripting to number crunching, but there's nothing popular that's anywhere near that. In the meantime, we have to be willing to accept the limitations of using only one language, or the challenge of learning several and switching between them.
In the example you posted, why not just erase the buffer and re-use it?
The .NET garbage collector is very very good at working out which objects aren't referenced anymore and freeing the associated memory in a timely manner. In fact, the garbage collector has a special heap (the large object heap) in which it puts large objects like this, that is optimized to deal with them.
On top of this, not allowing references to be explicitly freed simply removes a whole host of bugs with memory leaks and dangling pointers, that leads to much safer code.
Freeing each unused block individually as done in a language with explicit memory management can be more expensive than letting a Garbage Collector do it, because the GC has the possibility to use copying schemes to spend a time linear to the number of blocks left alive (or the number of recent blocks left alive) instead of having to handle each dead block.
The same reason most kernels won't let you schedule your own threads. Because 99.99+% of time you don't really need to, and exposing that functionality the rest of the time will only tempt you to do something potentially stupid/dangerous.
If you really need fine grain memory control, write that section of code in something else.
The fixnum question brought my mind to an other question I've wondered for a long time.
Many online material about garbage collection does not tell about how runtime type information can be implemented. Therefore I know lots about all sorts of garbage collectors, but not really about how I can implement them.
The fixnum solution is actually quite nice, it's very clear which value is a pointer and which isn't. What other commonly used solutions for storing type information there is?
Also, I wonder about fixnum -thing. Doesn't that mean that you are being limited to fixnums on every array index? Or is there some sort of workaround for getting full 64-bit integers?
Basically to achieve accurate marking you need meta-data indicating which words are used as pointers and which are not.
This meta-data could be stored per reference, as emacs does. If for your language/implementation you don't care much about memory use, you could even make references bigger than words (perhaps twice as big), so that every reference can carry type information as well as its one-word data. That way you could have a fixnum the full size of a 32 bit pointer, at the cost of references all being 64 bit.
Alternatively, the meta-data could be stored along with other type information. So for example a class could contain, as well as the usual function pointer table, one bit per word of the data layout indicating whether or not the word contains a reference that should be followed by the garbage collector. If your language has virtual calls then you must already have a means of working out from an object what function addresses to use, so the same mechanism will allow you to work out what marking data to use - typically you add an extra, secret pointer at the start of every single object, pointing to the class which constitutes its runtime type. Obviously with certain dynamic languages the type data pointed to would need to be copy-on-write, since it is modifiable.
The stack can do similar - store the accurate marking information in data sections of the code itself, and have the garbage collector examine the stored program counter, and/or link pointers on the stack, and/or other information placed on the stack by the code for the purpose, to determine which code each bit of stack relates to and hence which words are pointers. Lightweight exception mechanisms tend to do a similar thing to store information about where try/catch occurs in the code, and of course debuggers need to be able to interpret the stack too, so this can quite possibly be folded in with a bunch of other stuff you'd already be doing to implement any language, including ones with built-in garbage collection.
Note that garbage collection doesn't necessarily need accurate marking. You could treat every word as a pointer, regardless of whether it really is or not, look it up in your garbage collector's "big list of everything" to decide whether it plausibly could refer to an object that has not yet been marked, and if so treat it as a reference to that object. This is simple, but the cost of course is that it's somewhere between "quite slow" and "very slow", depending on what data structures your gc uses for the lookup. Furthermore, sometimes an integer just so happens to have the same value as the address of an unreferenced object, and causes you to keep a whole bunch of objects which should have been collected. So such a garbage collector cannot offer strong guarantees about unreferenced objects ever being collected. This might be fine for a toy implementation or first working version, but is unlikely to be popular with users.
A mixed approach might, say, do accurate marking of objects, but not of regions of the stack where things get particularly hairy. For example if you write a JIT which can create code where a referenced object address appears only in registers, not in your usual stack slots, then you might need to non-accurately follow the region of the stack where the OS stored the registers when it descheduled the thread in question to run the garbage collector. Which is probably quite fiddly, so a reasonable approach (potentially resulting in slower code) would be to require the JIT to always keep a copy of all pointer values it's using on the accurately marked stack.
In Squeak (also Scheme and many others dynamic languages I guess) you have SmallInteger, the class of signed 31-bit integers, and classes for arbitrarily big integers, e.g. LargePositiveInteger. There could very well be other representations, 64-something-bit integers either as full objects or with a couple bits as "I'm not a pointer" flags.
But arithmetic methods are coded to handle over/under-flows, such that if you add one to SmallInteger maxVal, you get 2^30 + 1 as an instance of LargePositiveInteger, and if you subtract one back from it, you get back 2^30 as a SmallInteger.