MIPS exception handling (Specifically branch delay slots) - exception

Say an exception has been hit in the branch delay slot of a conditional branch
e.g.
BEQ a0, zero, _true
BREAK (0000)
sw a0, 0000(t0)
_true:
sw a1, 0000(t0)
My exception handler will pick up the exception type 9 from the BREAK instruction and set the BD bit of the CAUSE register to 1 as it is in the branch delay and the EPC will be the address of the branch.
The documentation says that this will require complex processing which isn't described. i.e. Getting the target of the branch/jump, doing any required comparison then setting the PC to the true or false address.
My solution to get around the complex processing (which is a bit of a hack) is as follows:
Store the instruction in the branch delay slot
NOP the instruction in the branch delay slot
Return from the exception handler restoring all registers
re-execute the *BEQ a0, zero, _true* and the branch delay will be a nop so it will have no effect
Place a sw breakpoint at the target(s) of the branch and set a flag
once the sw breakpoint is hit restore the branch delay slot and remove traces of the sw breakpoints.
Parsing branches and jumps is fine (hence why i can get the targets) but in the conditional branches, once i have parsed, i then have to do the comparisons to determine whether to jump to the true part of go to the false (next line) which i feel is more work than i would like. Do I not??
My problem with my hacky method is:
Will the CPU have already stored that it has hit the conditional branch and have determined that after the branch delay slot has been executed whether it is going to take the branch or not, therefore once i point the Program Counter back to the branch and it gets executed instead of executing correctly it thinks it must jump to the true or false part of the branch which was pre-determined before the exception occurred? (try a "double jump")

do you got the MIPS programmers documents? if you want an 100% accurate answer read them - if not I can just tell you the important bits as I remember them.
in short - yes you need to load the instruction from memory, parse it and interpret the result to figure out where you have to continue. "Patching" the code as you expressed would work too, but you need to make sure the instruction cache gets invalidated, else you will be running from the cache and end in an infinite loop.
the updating of the PC follows after the delay slot has been executed, until then it will point to the branch. there is no special handling during an exception except you have a register which says if you are in a delay slot or not.
you`d need to emulate all instructions that can conditionally raise an exception in your handler (load/store) along with the branch instructions. if its another kind of instruction in the DS you just can restart at the branch (the exception was an external interrupt in this case).
if your concern is about performance then simply dont put exception-raising instructions in a delay slot.
edit: and no, MIPS stores nothing about interrupted instructions, but the method you are suggesting likely will be slower due to having to invalidate the ICache twice

Related

winmips64 branch target buffer does 2 cycles stall

So i have been searching why on winmips64 there were 2 cycles stalls each time branch target buffer mispredict but i got nothing .So it mispredicts the first time and the last that bne is running .In the first time says its 2 cycles of branch taken stall and on last 2 cycles of branch misprediction stall any ideas?R12 is mentioned in another part of the code
lw R4,0(R3)
lw R8,0(R2)
dmul R8,R8,R4
daddi R3,R3,-8
daddi R11,R2,-8
dadd R9,R9,R8
daddi R2,R2,-8
bne R11,R12,loop
I don't know the winmips64 architecture specifically, i.e. how it is or isn't different from other MIPS pipelined implementations.  So, if someone else knows specifics, please correct me if I'm wrong.
That both the branch taken and mispredict cost 2 cycles is consistent with standard 5 stage pipeline, where the branch decision (taken/not taken) is fully resolved in EX stage, and thus has to flush the instructions in prior stages, IF and ID, when it's wrong.
The first time the branch executes, it is unknown to the branch predictor, and the processor appears to move forward with an assumption of not taken; thus, message about the branch taken stall.
The last time the branch executes, it is know to the branch predictor, however, the branch is not taken, which doesn't match the prediction as the predictor is assuming it will continue to loop back as it has just been doing.  Thus, the message of misprediction.

How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS?

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

When do we create base pointer in a function - before or after local variables?

I am reading the Programming From Ground Up book. I see two different examples of how the base pointer %ebp is created from the current stack position %esp.
In one case, it is done before the local variables.
_start:
# INITIALIZE PROGRAM
subl $ST_SIZE_RESERVE, %esp # Allocate space for pointers on the
# stack (file descriptors in this
# case)
movl %esp, %ebp
The _start however is not like other functions, it is the entry point of the program.
In another case it is done after.
power:
pushl %ebp # Save old base pointer
movl %esp, %ebp # Make stack pointer the base pointer
subl $4, %esp # Get room for our local storage
So my question is, do we first reserve space for local variables in the stack and create the base pointer or first create the base pointer and then reserve space for local variables?
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc. Does C have a specific convention when it creates the machine code?
My reasoning is that all the code in a function would be relative to the base pointer, so as long as that function follows the convention according to which it created a reference of the stack, it just works?
Few related links for those are interested:
Function Prologue
In your first case you don't care about preservation - this is the entry point. You are trashing %ebp when you exit the program - who cares about the state of the registers? It doesn't matter any more as your application has ended. But in a function, when you return from that function the caller certainly doesn't want %ebp trashed. Now can you modify %esp first then save %ebp then use %ebp? Sure, so long as you unwind the same way on the other end of the function, you may not need to have a frame pointer at all, often that is just a personal choice.
You just need a relative picture of the world. A frame pointer is usually just there to make the compiler author's job easier, actually it is usually there just to waste a register for many instruction sets. Perhaps because some teacher or textbook taught it that way, and nobody asked why.
For coding sanity, the compiler author's sanity etc, it is desirable if you need to use the stack to have a base address from which to offset into your portion of the stack, FOR THE DURATION of the function. Or at least after the setup and before the cleanup. This can be the stack pointer(sp) itself or it can be a frame pointer, sometimes it is obvious from the instruction set. Some have a stack that grows down (in address space toward zero) and the stack pointer can only have positive offsets in sp based address (sane) or some negative only (insane) (unlikely but lets say its there). So you may want a general purpose register. Maybe there are some you cant use the sp in addressing at all and you have to use a general purpose register.
Bottom line, for sanity you want a reference point to offset items in the stack, the more painful way but uses less memory would be to add and remove things as you go:
x is at sp+4
push a
push b
do stuff
x is at sp+12
pop b
x is at sp+8
call something
pop a
x is at sp+4
do stuff
More work but can make a program (compiler) that keeps track and is less error prone than a human by hand, but when debugging the compiler output (a human) it is harder to follow and keep track. So generally we burn the stack space and have one reference point. A frame pointer can be used to separate the incoming parameters and the local variables using base pointer(bp) for example as a static base address within the function and sp as the base address for local variables (athough sp could be used for everything if the instruction set provides that much of an offset). So by pushing bp then modifying sp you are creating this two base address situation, sp can move around perhaps for local stuff (although not usually sane) and bp can be used as a static place to grab parameters if this is a calling convention that dictates all parameters are on the stack (generally when you dont have a lot of general purpose registers) sometimes you see the parameters are copied to local allocation on the stack for later use, but if you have enough registers you may see that instead a register is saved on the stack and used in the function instead of needing to access the stack using a base address and offset.
unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int x )
{
unsigned int y;
y = x;
return(more_fun(x+1)+y);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e2800001 add r0, r0, #1
c: ebfffffe bl 0 <more_fun>
10: e0800004 add r0, r0, r4
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Do not take what you see in a text book, white board (or on answers in StackOverflow) as gospel. Think through the problem, and through alternatives.
Are the alternatives functionally broken?
Are they functionally correct?
Are there disadvantages like readability?
Performance?
Is the performance hit universal or does it depend on just how
slow/fast the memory is?
Do the alternatives generate more code which is a performance hit but
maybe that code is pipelined vs random memory accesses?
If I don't use a frame pointer does the architecture let me regain
that register for general purpose use?
In the first example bp is being trashed, that is bad in general but this is the entry point to the program, there is no need to preserve bp (unless the operating system dictates).
In a function though, based on the calling convention one assumes that bpis used by the caller and must be preserved, so you have to save it on the stack to use it. In this case it appears to want to be used to access parameters passed in by the caller on the stack, then sp is moved to make room for (and possibly access but not necessarily required if bp can be used) local variables.
If you were to modify sp first then push bp, you would basically have two pointers one push width away from each other, does that make much sense? Does it make sense to have two frame pointers anyway and if so does it make sense to have them almost the same address?
By pushing bp first and if the calling convention pushes the first paramemter last then as a compiler author you can make bp+N always or ideally always point at the first parameter for a fixed value N likewise bp+M always points at the second. A bit lazy to me, but if the register is there to be burned then burn it...
In one case, it is done before the local variables.
_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.
The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).
Does C have a specific convention when it creates the machine code?
No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)
In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.
The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)
If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.
If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.
IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.
Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).
For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.
But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do
bigfunc:
push %ebp
lea -112(%esp), %ebp # first arg at ebp+8+112 = 120(%ebp)
sub $236, %esp # locals from -124(%ebp) ... 108(%ebp)
# saved EBP at 112(%ebp), ret addr at 116(%ebp)
# 236 was chosen to leave %esp 16-byte aligned.
Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.
bigfunc2: # first arg at 4(%esp)
sub $252, %esp # first arg at 252+4(%esp)
push %ebp # first arg at 252+4+4(%esp)
lea 140(%esp), %ebp # first arg at 260-140 = 120(%ebp)
push %edi # save the other call-preserved regs
push %esi
push %ebx
# %esp is 16-byte aligned after these pushes, in case that matters
(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)
But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)
These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.
AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)

Organizing pipeline in MIPS

I'm unsure about how the following properties affect pipeline execution for a 5 stage MIPS design (IF, ID, EX, MEM, WB). I just need some clearing up.
only 1 memory port
no data fowarding.
Branch stalls until end of * stage
Does the 1 memory port mean we cannot fetch or write when we read/write to mem (i.e. MEM stage on lw,sw you can't enter IF or another MEM)?
With no forwarding does this means an instruction won't enter the ID stage until after or on the WB stage for the previous instruction it depends on?
Idk what the branch stall means
A common assumption is that you can write in the first half of a cycle, and read in the second half of a cycle.
Lets say than I1 is your first instruction and I2 your second instruction, and I2 is using a register that I1 is modifying.
Only 1 memory port.
This means that you cannot read or write memory at the same time in two different stages of the pipelines.
For instance, if I1 is at the MEM stage, another instruction cannot be at the IF stage at the same time, because both require memory access.
No data forwarding. Data forwarding reflects to the fact that at the end of EX stage for I1, you forward the data to the ID cycle of I2.
Consequently, no forwarding means that the pipeline has to wait for the WB stage of the I1 to go to ID stage of I2. With the asumption, you can go to ID stage at the same time as the WB stage of the previous instruction, because WB will write to memory during the first half of the cycle, and ID will read from memory during the second half of the cycle.
Branch stalls until end of EX stage. This is a common asumption, that doesn't use branch prediction techniques. It simply states that an instruction after a branch has to wait until the end of EX stage to start ID stage. Recall that the address of the next instruction to be executed is known only at the EX stage of the branch instruction.
Comment: IF and MEM access separate sections of memory. One is data memory (.data) and the other instruction memory (.code or .text). It is designed this way so that accessing memory during IF and MEM does not cause a structural stall.
The area which is used by .data is that used by the stack, with the stack is traditionally placed right at the "end" of the .data sector. This is why if you don't subtract from the stack pointer address before saving data to the stack you run the risk of overwriting your program code. As MIPS allows you to designate the stack address manually, some people choose put the stack a bit "before" the end in order to avoid problems if they know they will have space and not overwrite variables in MEM. For instance, placing the stack at 0x300 instead of 0x400 in WinMIPS64. I am not sure if that is good practice or not. But I have heard of people doing it.

get wrong epc on MIPS

I know MIPS would get wrong epc register value when it happens at branch delay, and epc = fault_address - 4.
But now, I often get the wrong EPC value which is even NOT in .text segment such as 0xb6000000, what's wrong with the case??
Thanks for your advance..
The CPU does not know anything about the boundaries of the .text region in your program. It simply implements a 2^32 byte address space.
It is possible for an incorrectly programmed jump to go to any address within the 2^32 byte address space. The jump instruction itself will not cause any sort of exception - in fact the MIPS32® Architecture for Programmers Volume II: The MIPS32® Instruction Set explicitly states that jump (J, JR, JALR) instructions do not trigger any exceptions.
When the processor starts executing from the destination of an incorrectly programmed jump, in presumably uninitialized memory, what happens next depends on the contents of that memory. If uninitialized memory is filled with "random" data, that data will be interpreted as instructions which the processor will execute until an illegal instruction is found, or until an instruction triggers some other exception.