I am not able to understand few things on the Garbage collection.
Firstly, how is data allocated space ? i.e. on stack or heap( As per my knowledge, all static or global variables are assigned space on stack and local variables are assigned space on heap).
Second, GC runs on data on stacks or heaps ? i.e a GC algorithm like Mark/Sweep would refer to data on stack as root set right? And then map all the reachable variables on heap by checking which variables on heap refer to the root set.
What if a program does not have a global variable? How does the algorithm work then?
Regards,
darkie
It might help to clarify what platform's GC you are asking about - JVM, CLR, Lisp, etc. That said:
First to take a step back, certain local variables of are generally allocated on the stack. The specifics can vary by language, however. To take C# as an example, only local Value Types and method parameters are stored on the stack. So, in C#, foo would be allocated on the stack:
public function bar() {
int foo = 2;
...
}
Alternatively, dynamically-allocated variables use memory from the heap. This should intuitively make sense, as otherwise the stack would have to grow dynamically each time a new is called. Also, it would mean that such variables could only be used as locals within the local function that allocated them, which is of course not true because we can have (for example) class member variables. So to take another example from C#, in the following case result is allocated on the heap:
public class MyInt
{
public int MyValue;
}
...
MyInt result = new MyInt();
result.MyValue = foo + 40;
...
Now with that background in mind, memory on the heap is garbage-collected. Memory on the stack has no need for GC as the memory will be reclaimed when the current function returns. At a high level, a GC algorithm works by keeping track of all objects that are dynamically allocated on the heap. Once allocated via new, the object will be tracked by GC, and collected when it is no longer in scope and there are no more references to it.
Check out the book Garbage Collection: algorithms for automatic dynamic memory management.
Firstly, how is data allocated space ?
i.e. on stack or heap( As per my
knowledge, all static or global
variables are assigned space on stack
and local variables are assigned space
on heap).
No, stack variables are method calls and local variables. A stack frame is created when the method is called and popped off when it returns.
Memory in Java and C# is allocated on the heap by calling "new".
Second, GC runs on data on stacks or
heaps ? i.e a GC algorithm like
Mark/Sweep would refer to data on
stack as root set right? And then map
all the reachable variables on heap by
checking which variables on heap refer
to the root set.
GC is used on the heap.
Mark and sweep would not be considered a cutting edge GC algorithm. Both Java and .NET GC use generational models now.
What if a program does not have a
global variable? How does the
algorithm work then?
What does "global variable" mean in languages like Java and C# where everything belongs to a class?
The root of the object graph is arbitrary. I'll admit that I don't know how it's chosen.
Read this article. It is a very good survey on uniprocessor garbage collection techniques. It will give you the basic understanding and terminology on GC. Then, follow up with the Jones and Lins book "Garbage Collection: Algorithms for Automatic Dynamic Memory Management". Contrary to the survey article I point to above, the book is not available for free on the Web; you have to buy it; but it is worth it.
Richard and Carl have a very nice show on the Windows Memory Model, including the .NET model and GC, in their .NET Rocks! archives:
Jeffrey Richter on the Windows Memory Model
You might find the short summary of Garbage Collection on the Memory Management Reference useful.
Ultimately, garbage collection has to start at the registers of the processor(s), since any objects that can't be reached by the processor can be recycled. Depending the the language and run-time system, it makes sense to assume statically that the stacks and registers of threads are also reachable, as well as “global variables”.
Stacks probably get you local variables. So in simple GCs you start by scanning thread contexts, their stacks, and the global variables. But that's certainly not true in every case. Some languages don't use stacks or have global variables as such. What's more, GCs can use a barrier so that they don't have to look at every stack or global every time. Some specialised hardware, such as the Symbolics Lisp Machine had barriers on registers!
Related
Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).
I have come in contact with MIPS-32, and I came with the question if a variable, for example $t0 declared having the value in one function can be altered by another and how this does have to do with stack, this is, the location of the variable in memory. Everything that I am talking is in assembly language. And more, I would like some examples concerning this use, this is, a function altering or not, a variable value of another function, and how this variable "survive" or not in terms of if the variable is given as a copy or a reference.
(I hope we can create an environment where conceptual question like that above can be explored more)
$t0 declared having the value in one function can be altered by another
$t0 is known as a call-clobbered register. It is no different than the other registers as far as the hardware is concerned — being call clobbered vs. call preserved is an aspect of software convention, call the calling convention, which is a subset of an Application Binary Interface (ABI).
The calling convention, when followed, allows a function, F, to call another function, G, knowing only G's signature — name, parameters & their types, return type. The function, F, would not have to also be changed if G changes, as long as both follow the convention.
Call clobbered doesn't mean it has to be clobbered, though, and when writing your own code you can use it any way you like (unless your coursework says to follow the MIPS32 calling convention, of course).
By the convention, a call-clobbered register can be used without worry: all you have to do use it is put a value into it!
Call preserved registers can also be used, if desired, but they should be presumed to be already in use by some caller (maybe not the immediate caller, but some distant caller), the values they contain must be restored before exiting the function back to return to its caller. This is, of course, only possible by saving the original value before repurposing the register for a new use.
The two sets of register (call clobbered/preserved) serve two common use cases, namely cheap temporary storage and long term variables. The former requires no effort to preserve/restore, while the latter both does require this effort, though gives us registers that will survive a function call, which is useful, for example, when a loop has a function call.
The stack comes into play when we need to first preserve, then restore call-preserved registers. If we want to use call-preserved registers for some reason, then we need to preserve their original values in order to restore them later. The most reasonable way to do that is to save them in the stack. In order to do that we allocate some space from the stack.
To allocate some local memory, the stack pointer is decremented to reserve a function some space. Since the stack pointer, upon return to caller, must have the same value, this space is necessarily deallocated upon return. Hence the stack is great for local storage. Original values of preserved registers must be also restored upon return to caller and so using local storage is appropriate.
https://www.dyncall.org/docs/manual/manualse11.html — search for section "MIPS32".
Let's also make the distinction between variables, a logical concept, and storage, a physical concept.
In high level language, variables are named and have scopes (limited lifetimes). In machine code, we have physical hardware (storage) resources of registers and memory; these simply exist: they have no concept of lifetime. In and of themselves these hardware resources are not variables, but places that we can use to hold variables for their lifetime/scope.
As assembly language programmers, we keep a mental (or even written) map of our logical variables to physical resources. The compiler does the same, knowing the scope/lifetime of program variables and creating that "mental" map of variables to machine code storage. Variables that have overlapping lifetimes cannot share the same hardware resource, of course, but when a variable is out of scope, its (mapped-to) physical resource can be reused for another purpose.
Logical variables can also move around to different physical resources. A logical variable that is a parameter, may be passed in a CPU register, e.g. $a0, but then be moved into an $s register or into a (stack) memory location. Such is the business of machine code.
To allocate some hardware storage to a high level language (or pseudo code) variable, we simply initialize the storage! Hardware resources are necessarily constantly being repurposed to hold a different logical variable.
See also:
How a recursive function works in MIPS? — for discussion on variable analysis.
Mips/assembly language exponentiation recursivley
What's the difference between caller-saved and callee-saved in RISC-V
I am very interested in real time operating systems for micro-controllers, so I am doing a deep research on the topic. At the high level I understand all the general mechanisms of an OS.
In order to better learn it I decided to write a very simple kernel that does nothing but the context switch. This raised a lot of additional - practical questions to me. I was able to cope with many of them but I am still in doubt with the main thing - Saving context (all the CPU registers, and stack pointer) of a current task and restore context of a new task.
In general, OS use some function (lets say OSContextSwitch()) that preserves all the actions for the context switch. The body of the OSContextSwitch() is mainly written in assembly (inline assembly in C body function). But when the OSContextSwitch() is called by the scheduler, as far as I know, on a function call some of the CPU registers are preserved on the stack by the compiler (actually by the code generated by the compiler).
Finally, the question is: How to know which of the CPU registers are already preserved by the compiler to the stack so I can preserve the rest ? If I preserved all the registers regardless of the compiler behaviour, obviously there will be some stack leakage.
Such function should be written either as pure assembly (so NOT an assembly block inside a C function) or as a "naked" C function with nothing more than assembly block. Doing anything in between is a straight road to messing things up.
As for the registers which you should save - generally you need to know the ABI for your platform, which says that some registers are saved by caller and some should be saved by callee - you generally need to save/restore only the ones which are normally saved by callee. If you save all of them nothing wrong will happen - your code will only be slightly slower and use a little more RAM, but this will be a good place to start.
Here's a typical context switch implementation for ARM Cortex-M microcontrollers - https://github.com/DISTORTEC/distortos/blob/master/source/architecture/ARM/ARMv6-M-ARMv7-M/ARMv6-M-ARMv7-M-PendSV_Handler.cpp#L76
Which garbage collection algorithms can recognize garbage objects as soon as they become garbage?
The only thing which comes to my mind is reference counting with an added cycle-search every time a reference count is decremented to a non-zero value.
Are there any other interesting collection algorithms which can achieve that? (Note that I'm asking out of curiosity only; I'm aware that all such collectors would probably be incredibly inefficient)
Though not being a garbage collection algorithm, escape analysis allows reasoning about life-time of objects. So, if efficiency is a concern and objects should be collected not in all but in "obvious" cases, it can be handy. The basic idea is to perform static analysis of the program (at compile time or at load time if compiled for a VM) and to figure out if a newly created object may escape a routine it is created in (hence the name of the analysis). If the object is not passed anywhere else, not stored in global memory, not returned from the given routine, etc., it can be released before returning from this routine or even earlier, at the place of its last use.
The objects that do not live longer than the associated method call can be allocated on the stack rather than in the heap so they can be removed from the garbage collection cycle at compile time thus lowering pressure on the general GC.
Such a mechanism would be called "heap management", not garbage collection.
By definition, garbage collection is decoupled from heap management. This is because in some environments/applications it is more efficient to skip doing a "free" operation and keeping track of what is in use. Instead, every once in a while, just ago around and gather all the unreferenced nodes and put them back on the free list.
== Addendum ==
I am being downvoted for attempting to correct the terminology of heap management with garbage collection. The Wikipedia article agrees with my usage, and what I learned at university, even though that was several decades ago. Languages such as Lisp and Snobol invented the need for garbage collection. Languages such as C don't provide such a heavy duty runtime environment; instead the rely on the programmer to manage cleanup of unused bits of memory and resources.
I am trying to "map" a few tasks to CUDA GPU. There are n tasks to process. (See the pseudo-code)
malloc an boolean array flag[n] and initialize it as false.
for each work-group in parallel do
while there are still unfinished tasks do
Do something;
for a few j_1, j_2, .. j_m (j_i<k) do
Wait until task j_i is finished; [ while(flag[j_i]) ; ]
Do Something;
end for
Do something;
Mark task k finished; [ flag[k] = true; ]
end while
end for
For some reason, I will have to use threads in different thread block.
The question is how to implement the Wait until task j_i is finished; and Mark task k finished; in CUDA. My implementation is to use an boolean array as the flag. Then set flag once a task is done, and read the flag to check if a task is done.
But it only works on small case, one large case, the GPU get crashed with unknown reason. Is there any better way to implement the Wait and Mark in CUDA.
That's basically a problem of inter-thread communication on CUDA.
Synchronising within a threadblock is straightforward using __syncthreads(). However synchronising between threadblocks is more tricky - the programming model method is to break into two kernels.
If you think about it, it makes sense. The execution model (for both CUDA and OpenCL) is for a whole bunch of blocks executing on processing units, but says nothing about when. This means that some blocks will be executing but others will not (they'll be waiting). So if you have a __syncblocks() then you would risk deadlock, since those already executing will stop, but those not executing will never reach the barrier.
You can share information between blocks (using global memory and atomics, for example), but not global synchronisation.
Depending on what you're trying to do, there is frequently another way of solving or breaking down the problem.
What you're asking for is not easily done since thread blocks can be scheduled in any order, and there is no easy way to synchronize or communicate between them. From the CUDA Programming Guide:
For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: Either these threads belong to the same block, in which case they should use __syncthreads() and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
So if you can't fit all the communication you need within a thread block, you would need to have multiple kernel calls in order to accomplish what you want.
I don't believe there is any difference with OpenCL, but I also don't work in OpenCL.
This kind of problems is best solved by a slightly different approach:
Don't assign fixed tasks to your threads, forcing your threads to wait until their task becomes available (which isn't possible in CUDA since threads can't block).
Instead, keep a list of available tasks (using atomic operations) and have each thread grab a task from that list.
This is still tricky to implement and get the corner cases right, but at least it's possible.
I think you dont need to implement in CUDA. Every thing can be implemented on CPU. You are waiting for a task to complete, then doing another task randomly. If you want to implement in CUDA, you dont need to wait for all the flags to be true. You know initially that all the flags are false. So just implement Do something in parallel for all the thread and change the flag to true.
If you want to implement in CUDA, take int flag and keep on adding 1 it after finishing Do something so that you can know the change in flag before and after doing Do something.
If i got your question wrong, please comment. I'll try to improve the answer.