Tri-Color Incremental Updating GC: Does it need to scan each stack twice? - language-agnostic

Let me give you a short introduction to a tri-color GC (in case somebody reads it who has never heard of it); if you don't care, skip it and jump to The Problem.
How a Tri-Color GC Works
In a tri-color GC an object has one out of three possible colors; white, gray and black. A tri-color GC can be described as follows:
All objects are initially white.
All objects reachable because a global variable or a stack variable refers to it ("the root objects") are colored gray.
We take any gray object, find all references it has to white objects and color those white objects gray. Then we color the object itself black.
We continue at step 3 as long as we have gray objects.
If we have no gray objects any longer, all remaining objects are either white or black.
All black objects haven been proven to be reachable and must stay alive. All white objects are unreachable and can be deleted.
So far this is not too complicated… at least if the GC is StW (Stop the World), meaning it will pause all threads while collecting garbage. If it is concurrent, a tri-color GC has an invariant that must hold true at all times:
A black object must not refer to a white object!
This holds true automatically for a StW GC, since every object that is colored black has been examined previously and all white objects it was pointing to were colored gray, thus a black object may only refer to other black objects or gray objects.
If threads are not paused, threads can execute code that would break this invariant. There are several ways how to prevent this:
Capture all read access to pointers and look if this read access is made to a white object. If it is, color that object gray immediately. If a ref to this object is now assigned to a black object, it won't matter, the object is gray and not white any longer (this implementation uses a read-barrier)
Capture all write access to pointers and look if the assigned object is white and the object it is assigned to is black. If so, color the white object gray. This is the more obviously way of doing things, but also needs a bit more processing time (this implementation uses a write-barrier)
Since read-accesses are much more common than write-accesses, even though the second possibility involves more processing time when the barrier is hit, it is called less often and such the favored one. A GC working like that is called an "incremental updating GC".
There is an alternative to both techniques, called SatB (Snapshot at the Beginning). This variation works slightly different, considering the fact that it is not really necessary to uphold the invariant at all times, since it does not matter if a black object refers to a white one as long as the GC knows that this white object used to be and still is accessible during the current GC cycle (either because there are still gray objects referring to this white object as well, or because the a ref to this white object is put onto an explicit stack that is also considered by the GC when it runs out of gray objects). SatB collectors are used more often in practice, because they have some advantages, but IMHO they are harder to implement.
I'm referring here to a incremental updating GC, that uses variant 2: Whenever the code tries to make a black object point to a white object, it immediately colors the object gray. That way this object won't be missed in the collection cycle.
The Problem
So much about tri-color GCs. But there is one thing I don't understand about tri-color GCs. Let's assume we have an object A, that is referred to by the stack and itself refers to an object B.
stack -> A.ref -> B
Now the GC starts a cycle, halts the thread, scans the stack and sees A as directly accessible, coloring A gray. Once it is done with scanning the whole stack, it unpauses the thread again and starts processing at step (3). Before it starts doing anything, it is preempted (can happen) and the thread runs again and executes the following code:
localRef = A.ref; // localRef points to B
A.ref = NULL; // Now only the stack points to B
sleep(10000); // Sleep for the whole GC cycle
Since the invariant has not been violated, B was white, but has not been assigned to a black object, the color of B has not changed, it is still white. A does not refer to B any longer, so while processing the "gray" A, B won't change its color and A will become black. At the end of the cycle, B is still white and looks like garbage. However, localRef is referring to B, thus it is not garbage.
The Question
Am I right, that a tri-color GC must scan the stack of each thread twice? Once at the very beginning, to identify root objects (getting color gray) and again before deleting the white objects, as those might be referenced by the stack, even though no other object refers to them any longer. No description of the algorithm I've seen so far mentioned anything about scanning the stack twice. They all only said, that when used concurrent, it is important that the invariant is enforced at all time, otherwise reachable objects are missed. But as far as I can see, that is not enough. The stack must be considered like a single big object and once scanned, the "stack is black" and every ref update of the stack must cause the object to be colored gray.
If that is really the case, using incremental updating may be more tricky than I initially thought and has some performance drawbacks, since stack changes are the most frequent ones of all.

A bit of terminology:
Let me give some names so that explanations are clearer.
A variable is any slot for data, which may contain a pointer and may change over time. This includes global variables, local variables, CPU registers and fields in allocated objects.
In a tricolor incremental or concurrent GC, there are three types of variables:
the true roots, which are always accessible (CPU registers, global variables);
the fast variables, which are scanned in a stop-the-world fashion;
the slow variables, which are handled with the colors. Slow variables are fields in colored objects.
The "true roots" and "fast variables" will be hereafter collectively called roots.
The application threads are called the mutators because they change the contents of variables.
With an incremental or concurrent GC, GC pauses occur regularly. The world is stopped (mutators are paused), and the roots are scanned. This scan reveals a number of references to colored objects. Object colors are adjusted accordingly (such white objects are made grey).
When the GC is incremental, some object scanning activity takes place: some grey objects are scanned (and painted black), greying referenced white objects. This activity (the "marking") is maintained for some time, but not necessarily as long as there are grey objects. At some point, the marking stops and the world is awakened. The GC is called "incremental" because the GC cycle is performed in small increments, interleaved with mutator activity.
In a concurrent GC, scanning of grey objects occurs concurrently with mutator activity. The world is then awakened as soon as the roots have been scanned. With a concurrent GC, access barriers are a tad complex to implement because they must handle concurrent access from the GC thread; but at a conceptual level, this is not very different from an incremental GC. A concurrent GC can be viewed as an optimization over incremental GC, which takes advantage of the presence of multiple CPU cores (a concurrent GC has little advantage over an incremental GC when there is only one core).
Roots need not be protected by an access barrier, since they are scanned with the world stopped. The GC mark phase ends when the following conditions are simultaneously met:
the roots have just been scanned;
all objects are either black or white, but not grey.
so this situation can occur only during a pause. At that point, the sweep phase begins, during which white objects are released. The sweep can be done incrementally or concurrently; objects created during the sweep are immediately painted black. When the sweep is finished, a new GC mark phase can take place: objects (which are all black at that point) are all repainted white (this is done atomically by simply changing the way color bits are interpreted).
Variable Classification:
With that being said, I can now answer to your question. With the description above, the question becomes: what are the roots ? This is actually up to the implementation; there are several possibilities, and trade-offs.
True roots must always be scanned; true roots are the CPU register contents and the global variables. Note that the stacks are not true roots; only the current stack frame pointer is.
Since fast variables are accessed without barriers, it is customary to make stack frames fast variables (i.e. roots as well). This is because while write accesses are rare system-wide, they are quite common in the local variables. It has been measured (on some Lisp programs) that about 99% of writes (of a pointer value) have a local variable as target.
Fast variables are often extended even further, in the case of a generational GC: the "young generation" consists in a special allocation area for new objects, limited in length, and scanned as fast variables. The bright side of fast variables is fast access (hence the name); the downside is that all these fast variables may be scanned only during a pause (the world is stopped). There is a trade-off on the size of the fast variables, which often translates to a limit on the young generation size. A larger young generation promotes average performance (by reducing the number of access barriers) at the cost of longer pauses.
At the other extreme, you may have no fast variable at all, and no root but the true roots. The stack frames are then handled as objects, each with their own color. Pauses are then minimal (a mere snapshot of a dozen register) but barriers must be used even for access to local variables. This is expensive, but has some advantages:
Hard guarantees on pause times can be made. This is difficult if stack frames are roots, because each new thread has its own stack, so the roots total size may grow up to arbitrary amounts as new threads are launched. If only CPU registers and global variables (no more than a few dozens in typical cases, and their number is known at compilation time) are roots, then pauses can be kept very short.
This allows for dynamic allocation of stack frames in the heap. This is needed if you play with co-routines and continuations, as with Scheme's call/cc primitive. In such a case, frames are no longer handled as a pure "stack". Proper handling of continuations in a GC-aware language mostly requires that function frames be allocated dynamically.
It is possible to make stack frames non-root while keeping a young generation as root. Guarantees on pause times can still be made (depending on the young generation size, which is fixed) and some trickery can be applied to make sure that stack frames are in the young generation when their function is active. This can ensure barrier-free access to local variables. None of this is really free, but it can be made efficient enough for most purposes.
Another Conceptual View:
Another way to view root-handling is the following: roots are the variables for which the tricolor rule (no black-to-white pointer) is not maintained at all times; these variables are allowed to be mutated without constraint. But they must be brought back in line regularly, by stopping the world and scanning them.
In practice, the mutators are racing with the GC. Mutators create new objects, and point to them; each pause creates new grey objects. In a concurrent or incremental GC, if you let the mutators play with roots for too long, then each pause may create a big batch of new grey objects. In the worst case, the GC cannot scan objects fast enough to keep up with the rate of grey object creation. This is an issue because white objects can be released only during the sweep phase, which is reached only if at some point the GC may complete its marking. A usual implementation strategy for an incremental GC is to scan grey objects, during each pause, for a total size which is proportional to the total size of roots. Thus, pause time remains bounded by the roots total size, and if the proportionality factor is well balanced then it can be guaranteed that the GC will ultimately terminate is marking phase and enter the sweeping phase.
In a concurrent GC, things are a bit more complex, because the mutators roam freely in the wild. A possible implementation would make a little bit of incremental marking while the world is still stopped.
Bibliography:
Garbage Collection: Algorithms for Automatic Dynamic Memory Management: a must-read book on garbage collection.

Thomas obviously has the best answer. However, I'm just going to add a little side-note here.
The root nodes can be conceptually thought of as black nodes, because any object referred to by a root node must by grey or black.
Therefore, in order to maintain the invariant, assignment to a root variably should automatically grey the object.

Related

Is a "single cycle cpu" possible if asynchronous components are used?

I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.
This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)
The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible

Why would alpha zero not run out of memory

After having read through the Deep Mind's Alpha Zero paper, I understood that we are building up a tree and adding a new node to the tree every time we see a new node. For a game like GO (or even CHESS) with such huge state spaces, and such a large training time, we should definitely exceed any practical memory size for such a tree.
But as I know, the algorithms have been practically implemented. Where is the gap in my understanding?

Manage resources to minimize garbage collection activity and improve performance

I am working on a graphic design, vector drawing application that needs to render the data in every frame when there is a change. The issue is, that if the user is moving nodes, there will be changes during every single frame. This is not an issue with a tiny amount of data and is a major slowdown when there is anything more than a minor amount of data.
The reason is that in order to render I preform calculations and store data inside arrays. Then when the function responsible for the computation is done, the GC simply discards the data and next time the function is called, we create new arrays and new data.
In C++ I would probably allocate space in the memory and write to that space(over and over). I would probably improve performance that way. In languages that us GC I cannot allocate space that way. I have to do an ugly hack where I define an array as a class member and then write to that array from the function over and over although that array is only used in that one function and is not used by other methods of the class.
My questions is, what is the best way to reuse memory space in a language that uses GC?
Object pooling would be the major one, see here:
Gotoandplay Tutorial
Also
10 Top Tips around GC
I would also suggest you read through Grant's explanation of the garbage collection system in the Flash Player, it's quite unique, and understanding how Flash handles data is quite important to data intensive scripts.
This presentation

What are Zombies and what causes them? Are there Zombie processes and Zombie objects?

I can find questions about zombies but none that directly addresses what they are and why and how they occur. There are a couple that address what zombie processes are in the context of answering a specific question but don't address the cause.
There are also questions regarding zombie processes and questions about Objective-C/Cocoa-related zombie objects. What are the differences or how are these related? Is an "EXEC_BAD_ACCESS" on Mac/iPhone (or similar error on other platforms) synonymous with a zombie?
How can one prevent zombies and are there any best practices that will help avoid them?
It would be helpful to have this information in one place. This question is intended to be platform/language agnostic, if possible.
Zombie processes and zombie objects are totally unrelated. Zombie processes are when a parent starts a child process and the child process ends, but the parent doesn't pick up the child's exit code. The process object has to stay around until this happens - it consumes no resources and is dead, but it still exists - hence, 'zombie'.
Zombie objects are a debugging feature of Cocoa / CoreFoundation to help you catch memory errors - normally when an object's refcount drops to zero it's freed immediately, but that makes debugging difficult. Instead, if zombie objects are enabled, the object's memory isn't instantly freed, it's just marked as a zombie, and any further attempts to use it will be logged and you can track down where in the code the object was used past its lifetime.
EXEC_BAD_ACCESS is your run-of-the-mill "You used a bad pointer" exception, like if I did:
(*(0x42)) = 5;
When a process ends, much of its state still exists in the kernel, because its parent may still want to look at a few things, like its return value, which needs to be stored someplace. When a parent calls wait() or waitpid(), it tells the kernel to throw it all away because it's done with it. Until it does so, the child retains a pid and uses up resources. Those un-reaped child processes are called zombies. Even killing a zombie won't remove it, it must be reaped (wait-ed-upon) by its parent. If the parent dies, they are passed to "init" on unix systems, whose sole job is to wait for things to clean them up.
I've never heard of "zombie objects", but I assume that it refers to things that have either not been cleaned up by the garbage collector, or that have circular references or some such thing, such that they are not going to be cleaned up by the garbage collector. The metaphor is pretty similar: fork()==malloc(), wait()==free() at a certain level. (Not a perfect metaphor, of course.)

Why do garbage collectors freeze execution?

I was thinking about garbage collection on the way home, and I began wondering, why does the garbage collector totally freeze execution of a program? Personally I would have designed it to block any threads which try to allocate a new object, but threads which were running would be left alone.
I can't imagine any situation where this would be a problem compared to how a garbage collector currently works.
I was thinking about garbage collection on the way home, and I began wondering, why does the garbage collector totally freeze execution of a program?
There is a trade-off between latency and throughput in GC design. You can either process heap-allocated blocks individually ("incremental") or you can batch them up and process them all at the same time ("stop the world"). Fully incremental collection never totally freezes a program and it has very low latency but it also has very poor throughput. Stop the world garbage collectors have the worst possible latency (freezing the program for seconds or even minutes at a time) but near-optimal throughput.
All of the major production GCs today provide a middle ground, typically with generational collection with the per-thread nursery generations collected in batches and incremental or concurrent collection of the shared old generation. Thus, only nursery collections incur pauses and nursery size is bounded so pause times are kept low, e.g. 10-100ms in .NET with the workstation GC.
For a simple GC algorithm that never pauses, see Baker's Treadmill. For more information on garbage collection I highly recommend the Memory Management Reference and the Garbage Collection Handbook.
There is a lot of misinformation in the other answers here. Jon Skeet wrote some source code and started discussing it from the point of view of garbage collection. You need to be very careful doing this because there is little correspondence between source code and what the GC sees. The compiler does instruction block rearrangements, register allocation, promotion and so on, all of which affect what is visible to the GC at run time. In particular, scope in source code is not carried through to compiled code and is typically replaced with the related concept of liveness. Jon also wrote that you must pause in order to get the global roots. That is not strictly true although it is the most efficient way to get the global roots and the resulting pause is almost always tiny (sub-millisecond) because you're just copying less than a kB of stack from each thread.
Powerlord wrote that moving collectors must block reads and, therefore, all threads that read. This is also not true. The simplest counter example is immutable data: referential transparency means you can read from any copy safely.
Kico wrote that pauses are required to determine reachability. This is also not true. See Dijkstra's research about "on-the-fly" collectors and any recent real-time GC such as Stacatto.
Jerry Coffin wrote the best answer but moving isn't the reason GCs pause. There are GCs that don't move but do pause (e.g. HLVM's) and those that do move but don't pause (e.g. Stacatto).
Modern garbage collectors (in .NET and Java, anyway) don't actually "stop the world" - they do all kinds of clever things to collect concurrently.
However, you might want to consider a situation like this:
object x = null;
object y = new object();
...
x = y;
y = null;
Now, suppose the GC looks at x, then the lines below the ... run, and then the GC looks at y - it won't have seen any live objects... but the object should still be live.
Basically there needs to be a certain amount of pausing in order to get a consistent set of references. Then there's compaction, reference reassignment etc. However, it isn't nearly as bad as it used to be in terms of requiring everything to be stopped for the whole of the GC cycle. It does, however, get painful to think about :)
In addition to what Kico Lobo said, Garbage Collectors can also move things around in memory.
Therefore, they don't just have to block threads that write to memory, but also threads that read from memory.
Which is every thread.
Most GCs stop execution because objects can move in memory during a collection cycle (at least with most reasonably recent designs). That means either reading or writing almost any object at the wrong time can cause a problem.
There are collectors that have been designed around the idea of just blocking reads (or writes) to the specific parts of memory being modified at a given time, so as long as execution only uses objects that aren't (currently) being moved around, it can proceed unhindered. The problem is that most typical hardware doesn't provide efficient support for this, so even though they work in principle, they're fairly inefficient in practice. There has been at least one attempt at adapting that type of algorithm to use the write protection available in a typical paging unit, but I'm not aware of its having been used for much other than research and experimentation.
The primary alternative is to make the collector incremental -- i.e. have it do only a small amount of work at a time, so even though other execution gets stopped, it only has to stop for a little while at any given time.
With multi-core machines becoming so common, however, I'd expect to see more work put into garbage collection algorithms that can run in parallel with other execution. Up until recently, the primary emphasis was on minimizing the total time/effort spent on garbage collection. The growing number of cores available is likely to (often) mean that doing more total work in garbage collection may be easily justified, if doing so allows the mainstream of the code to run with fewer hindrances.
Edit: You might want to read Paul Wilson's Survey of Uniprocessor Garbage Collection Techniques. This isn't definitive (especially any more, given its age), but it's at least a reasonable starting point.
Because that's the only way it can assure that the refereces it is going to clean are not been used by anyone else.
If it didn´t freezed the execution, it could not assure that.