Stack bounds checking on the Amiga 500 - amiga

I have a 68000 assembly language program running on my Commodore Amiga 500 that could potentially use a lot of stack space, so I want to do bounds checking.
If I call FindTask(NULL), and check tc_SPUpper and tc_SPLower, I get $c22c24 and $c21fa4, which is 3200 bytes of stack; however, the CLI has 8000 bytes of stack allocated, and the program starts with a stack pointer of $c29598—about 26K higher in memory than tc_SPUpper.
I read in the AmigaDOS Developer's Manual that, on start, 4(sp) contains the stack size. This value does contain 8000. ("Below this on the stack at 4(SP) is the size of the stack in bytes, which may be useful if you wish to perform stack checking.")
Can I safely take sp - 4(sp) as the lower limit of the stack? Do I need to allow for the stack size, the return address, and some other data that the CLI may have on the stack?

After re-re-(…)-reading the manuals, I may have figured it out.
From Amiga ROM Kernel Reference Manual: Libraries & Devices, p.584:
The CLI does not create a new process for a program; it jumps to the
program's code and the program shares the process with the CLI.
From this, I gather that the process returned by FindTask(NULL) is the CLI process, and tc_SPUpper and tc_SPLower refer to the stack for that process.
From AmigaDOS Developer's Manual, p. 160:
When the CLI starts up a program, it allocates a stack for that
program. This stack is initially 4000 bytes, but you may change the
stack size with the STACK command. AmigaDOS obtains this stack from
the general free memory heap just before you run the program; it is
not, however, the same as the stack that the CLI uses.
From this, I conclude that my program stack is separate from the stack in the task returned by FindTask(NULL).
Also from AmigaDOS Developer's Manual, p. 160:
AmigaDOS pushes a suitable return address onto the stack that tells
the CLI to regain control and unload your program. Below this on the
stack at 4(SP) is the size of the stack in bytes…
From this, I conclude that, for programs run from the CLI, the following code will give me the lowest address available on the stack.
move.l sp,d0 ; current stack pointer
addq.l #8,d0 ; return address and stack size
sub.l 4(sp),d0 ; size of stack
move.l d0,stack_lowest ; save for stack checking
For programs launched from Workbench, I think tc_SPUpper and tc_SPLower are the values that I want.
From Amiga ROM Kernel Reference Manual: Libraries & Devices, p.584:
When a user activates a tool or project, Workbench runs a program.
This program is a separate process and runs asynchronously to
Workbench.
I have confirmed that the difference between these two values is, indeed, the stack size specified in the .info file.

Related

How is stack frame managed within a thread in Cuda?

Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

Why my cpu seems to lose the ability to decode

I meet this problem when finishing the lab of my OS course. We are trying to implement a kernel with the function of system call (platform: QEMU/i386).
When testing the kernel, problem occurred that after kernel load user program to memory and change the CPU state from kernel mode to user mode using 'iret' instruction, CPU works in a strange way as following.
%EIP register increased by 2 each time no matter how long the current instrution is.
no instruction seems to be execute, for no other registers change meantime.
Your guest has probably ended up executing a block of zeroed out memory. In i386, zeroed memory disassembles to a succession of "add BYTE PTR [rax],al" instructions, each of which is two bytes long (0x00 0x00), and if rax happens to point to memory which reads as zeroes, this will effectively be a 2-byte-insn no-op, which corresponds to what you are seeing. This might happen because you set up the iret incorrectly and it isn't returning to the address you expected, or because you've got the MMU setup wrong and the userspace program isn't in the memory where you expect it to be, for instance.
You could confirm this theory using QEMU's debug options (eg -d in_asm,cpu,exec,int,unimp,guest_errors -D qemu.log will log a lot of execution information to a file), which should (among a lot of other data) show you what instructions it is actually executing.

Will recursive function throw StackOverflowError

I have a code that calls itself (java) ..when it does not get some values.
public void recfunction() {
---do something----
if not found a then
call recFunction();
if found a save a and then exit the function.
}
recfunction can be called at most 5 times within itself. I get the value of "a" within 5 times. Will I get a StackOverflowError if I run this function 1000 times.
Edit: What I am trying to ask is when the function exits...will the Stack frames for all the calls be removed.
you can try it and find out, but no, as long as it is executing actual code and stays within stack limits it will not. stack size in java is 400k so as long as you're not tracking tons of varas on each call you should be fine.
If it only runs 5 times before throwing a stackoverflowerror, it's because you exceeded your stack.
The specific answer to your question is, "it depends."
But in general, yes, a recursive function that runs amok can cause a stack overflow error.
Particularly in Java, you have some control of stack size. So you can start the JVM with a smaller or larger stack, which will obviously impact how much stack there is to overflow.
Also, the amount of local storage you allocate in the function will contribute. If your function doesn't use any stack at all - just calls itself and maybe decrements a counter - you will have to use up the stack with only return address pointers and basic stack frame data.
But if you allocate a 1000-element array of objects in each local stack frame, then obviously you will consume more stack "per call", and thus you can overflow the stack in fewer levels of recursion.
Firstly, that code clearly won't compile, let alone run...
Secondly, assuming it did compile you could compile it and run it.
Most importantly, if you were to run it on multiple configurations you would discover different answers. See this question for example:
java -Xss4m Test
This will increase the stack size to four megabytes.

Strange MIPS AdES exception

I am giving very limited information, I know, but it is probably enough.
System specs: MIPS 64-bit processor with 8 cores running 4 virtual cpus each.
OS: Some proprietary Linux based on the 2.6.32.9 kernel
Process: a simple user-land process running 7 posix threads. This specific application is running on core 0, which does not have any cpu affinity by any process.
The crash is almost impossible to reproduce. There is no specific scenario. We know that, if we perform some minor activity with the application it might crash once a day.
The specific thread that's crashing wakes up every 5 milliseconds, reads information from one shared memory area and updates another. That's it.
There is not too much activity. The process is not working too hard.
Now: When I open the core and load a symbol-less image of the application, gdb points to instruction 100661e0. Instruction 100661e0 looks like this (viewing with an objdump view of the non-stripped image):
void class::foo(uint8_t factor)
{
100661d8: ffbf0018 sd ra,24(sp)
100661dc: 0080802d move s0,a0
bar(factor, shared_memory_a->profiles[PROFILE_1]);
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
100661e4: 64c673c8 daddiu a2,a2,29640
bar(factor, shared_memory_a->profiles[PROFILE_2]);
100661e8: de060010 ld a2,16(s0)
The line that shows as the exception line is
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
Note that 10066148 is a valid instruction.
The Bad register contains the following address, which is aligned, but does not look valid as far as the instruction address space is concerned: c0000003ddc5dd90
The cause register contains this value: 0000000000800014
I don't understand why the Bad registers shows the value it does, where the instruction specifically states a valid instruction. I am a little concerned about branch delay slot issues, but I shouldn't be concerned about those when I am running a userland application using simple C++.
I'd appreciate any thoughts.

kernel stack vs user-mode application stack

Is the kernel stack a different structure to the user-mode stack that is used by applications we (programmers) write?
Can you explain the differences?
Conceptually, both are the same data structure: a stack.
The reason why there are two different stack per thread is because in user mode, code must not be allowed to mess up kernel memory. When switching to kernel mode, a different stack in memory only accessible in kernel mode is used for return addresses an so on.
If the user mode had access to the kernel stack, it could modify a jump address (for instance), then do a system call; when the kernel jumps to the previously modified address, your code is executed in kernel mode!
Also, security-related information/information about other processes (for synchronisation) might be on the kernel stack, so the user mode should not have read access to it either.
The stack of a typical modern operating system is just a region of memory used for storing return addresses and local data. It has the same structure in both the kernel and user-mode, but each thread gets its own memory area for storing its stack. Context switches restore the stack pointer, so no thread sees another thread's stack even though they may be able to share other memory (if the threads are in the same process).
A thread doesn't have to use the stack by the way. The operating system makes assumptions about how it will be used, but the thread doesn't have to follow them.