winmips64 branch target buffer does 2 cycles stall - mips

So i have been searching why on winmips64 there were 2 cycles stalls each time branch target buffer mispredict but i got nothing .So it mispredicts the first time and the last that bne is running .In the first time says its 2 cycles of branch taken stall and on last 2 cycles of branch misprediction stall any ideas?R12 is mentioned in another part of the code
lw R4,0(R3)
lw R8,0(R2)
dmul R8,R8,R4
daddi R3,R3,-8
daddi R11,R2,-8
dadd R9,R9,R8
daddi R2,R2,-8
bne R11,R12,loop

I don't know the winmips64 architecture specifically, i.e. how it is or isn't different from other MIPS pipelined implementations.  So, if someone else knows specifics, please correct me if I'm wrong.
That both the branch taken and mispredict cost 2 cycles is consistent with standard 5 stage pipeline, where the branch decision (taken/not taken) is fully resolved in EX stage, and thus has to flush the instructions in prior stages, IF and ID, when it's wrong.
The first time the branch executes, it is unknown to the branch predictor, and the processor appears to move forward with an assumption of not taken; thus, message about the branch taken stall.
The last time the branch executes, it is know to the branch predictor, however, the branch is not taken, which doesn't match the prediction as the predictor is assuming it will continue to loop back as it has just been doing.  Thus, the message of misprediction.

Related

How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS?

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

Correlated branch prediction

I have this exercise related to correlated predictors that states the following:
A: BEQZ R1, D
…
D: BEQZ R1, F
…
F: NOT R1, R1
G: JUMP A
The prediction works like follows
fetch the current instruction
if it is a branch, determine the current state of the predictor and predict the branch:
a.row is determined by the branch address (in this case either A or D)
b.column is determined by the current global shift register
c.use the value in the cell to determine the prediction from the state machine (current state is saved in the cell)
Execute the branch, and determine the actual decision
(Taken: 1, Not Taken: 0):
a.update the cell based on the current state and the
actual decision
b.update the global shift register (shift left and add the actual decision bit to right)
goto step 1
This is the solution
Solved exercise
I understood the scheme and know that a 2 bit predictor means less errors but I cannot solve this question and I have trouble finding how the solution was found, any help would be appreciated.
This a variation of the briefly described Two-level adaptive predictor with global history table in the Agner Fog's microarchitecture paper (page 15).
In this variant, the history register is shared across all branches however the pattern history table is local to a branch1.
The outcome of the last n (n = 2, in your case) branches is remembered (0 = Not taken, 1 = Taken), ordered from left to right in chronological order, forming a value of n bit that is used, along with the branch address2, to index a table of 2-bit saturating counters.
Each counter is incremented if the branch is taken and decremented otherwise (this is the canonical implementation, any 4-state FSA will do).
The meaning of each counter value is:
00b (0) Strongly not taken
01b (1) Weakly not taken
10b (2) Weakly taken
11b (3) Strongly taken
Saturating means than 3+1 (A strongly taken branch is taken again) = 3 and that 0-1 (A strongly not taken branch is again not taken) = 0 while normally arithmetic on registers is modulo 2n.
In your exercise the assumptions are:
The pattern history table is given as a 2D-table with rows corresponding to the full address of the branch and columns to the value of the global history register.
All counters start in the state 01b (Weakly not taken).
The global history register is 00b at reset.
R1 is 0 at the beginning.
Let's see the first iteration only.
First iteration
The instruction is BEQZ R1, D (a branch, obviously), its address is A.
Since R1 is 0, the branch will be taken (towards D).
Indexing into the table with a global history of 00b and address A gives us the counter value 01b (Weakly not taken) thus the prediction is not taken.
Once the CPU has executed the branch and flushed the mispredicted stage, the table must be updated.
Since the branch was taken, the counter is incremented from 01b to 10b.
Finally, the global history goes from 00b to 01b since the branch is taken (a 1 is shifted in from the right).
Note that the yellow highlighted entries are those read when the corresponding instruction is executed, while the green ones are those updated by the previous prediction.
Thus to see that the counter value has been incremented you have to look at the next row.
Since the branch was taken, the CPU is at D (BEQZ R1, F), this is exactly the same as before, only the global history register has value 01b.
After this instruction is executed the CPU is at F, so R1 becomes 111..11b (the solution just indicates it as 1) and the two above branches are re-executed.
1This is a simplification, the table is almost always a cache. It's impractical to an entry for each possible memory address where a branch can be found.
2Part of the address is used as the index in the cache, once a set has been selected, the address is again compared with the tag in each way of the set.

MIPS Datapath Confusion

Been learning about mips datapath and had a couple questions.
Why is there a writeback stage?
-Thoughts: If it didn't add more latency or make the clock cycles longer it seems like you could move the mux in the writeback stage into the Mem stage and remove the Mem/Writeback buffer and get rid of the writeback stage entirely. Why is this not the case?
Confusion about branch prediction and stalls.
-Thoughts: If an add instruction follows beq instruction into the pipline (beq in ID stage, add in fetch stage) but the branch is taken, how does the add instruction then become converted to a no-op? (What control signals are set, how?)
When are the inter-stage buffers updated?
Thoughts: I think they are updated at the end of the clock cycle but have been unable to verify this. Also, I am trying to understand what exactly happens during a stall. When a stall is needed does the IF/ID inter-stage buffer get locked? If so how is this done? Does the instruction then read from the buffer to determine what instruction should be in the ID stage?
Thanks for any help
Here's a picture of the pipeline:
Writeback stage is for writing the result back to registers. MEM/WB buffer is there to hold any data from the previous stage. By getting rid of the writeback stage, what you'll be doing is essentially extending the mem stage. For example in an instruction like,
LW R1, 8(R2)
contents of the memory location addressed by 8(R2) will be stored in the MEM/WB buffer. By copying the contents to the buffer, MEM stage can now accept another LW instruction, hence more ILP.
#Craig Estey have answered correctly for this. However even if you dont't do the swapping #Craig has mentioned, you can always use control signals and flush things if IF, ID stages for the following instructions.
I am not sure there is a precise answer as to when an inter stage buffer is updated. The way I see it is, at the beginning of a clock cycle, data in the inter stage buffer is not relevant and at the end of a clock cycle it is relevant. Control signals are being used to control whats is happening in each stage of the pipeline, meaning they can be used to tell IF stage not to fetch any.

Organizing pipeline in MIPS

I'm unsure about how the following properties affect pipeline execution for a 5 stage MIPS design (IF, ID, EX, MEM, WB). I just need some clearing up.
only 1 memory port
no data fowarding.
Branch stalls until end of * stage
Does the 1 memory port mean we cannot fetch or write when we read/write to mem (i.e. MEM stage on lw,sw you can't enter IF or another MEM)?
With no forwarding does this means an instruction won't enter the ID stage until after or on the WB stage for the previous instruction it depends on?
Idk what the branch stall means
A common assumption is that you can write in the first half of a cycle, and read in the second half of a cycle.
Lets say than I1 is your first instruction and I2 your second instruction, and I2 is using a register that I1 is modifying.
Only 1 memory port.
This means that you cannot read or write memory at the same time in two different stages of the pipelines.
For instance, if I1 is at the MEM stage, another instruction cannot be at the IF stage at the same time, because both require memory access.
No data forwarding. Data forwarding reflects to the fact that at the end of EX stage for I1, you forward the data to the ID cycle of I2.
Consequently, no forwarding means that the pipeline has to wait for the WB stage of the I1 to go to ID stage of I2. With the asumption, you can go to ID stage at the same time as the WB stage of the previous instruction, because WB will write to memory during the first half of the cycle, and ID will read from memory during the second half of the cycle.
Branch stalls until end of EX stage. This is a common asumption, that doesn't use branch prediction techniques. It simply states that an instruction after a branch has to wait until the end of EX stage to start ID stage. Recall that the address of the next instruction to be executed is known only at the EX stage of the branch instruction.
Comment: IF and MEM access separate sections of memory. One is data memory (.data) and the other instruction memory (.code or .text). It is designed this way so that accessing memory during IF and MEM does not cause a structural stall.
The area which is used by .data is that used by the stack, with the stack is traditionally placed right at the "end" of the .data sector. This is why if you don't subtract from the stack pointer address before saving data to the stack you run the risk of overwriting your program code. As MIPS allows you to designate the stack address manually, some people choose put the stack a bit "before" the end in order to avoid problems if they know they will have space and not overwrite variables in MEM. For instance, placing the stack at 0x300 instead of 0x400 in WinMIPS64. I am not sure if that is good practice or not. But I have heard of people doing it.

MIPS exception handling (Specifically branch delay slots)

Say an exception has been hit in the branch delay slot of a conditional branch
e.g.
BEQ a0, zero, _true
BREAK (0000)
sw a0, 0000(t0)
_true:
sw a1, 0000(t0)
My exception handler will pick up the exception type 9 from the BREAK instruction and set the BD bit of the CAUSE register to 1 as it is in the branch delay and the EPC will be the address of the branch.
The documentation says that this will require complex processing which isn't described. i.e. Getting the target of the branch/jump, doing any required comparison then setting the PC to the true or false address.
My solution to get around the complex processing (which is a bit of a hack) is as follows:
Store the instruction in the branch delay slot
NOP the instruction in the branch delay slot
Return from the exception handler restoring all registers
re-execute the *BEQ a0, zero, _true* and the branch delay will be a nop so it will have no effect
Place a sw breakpoint at the target(s) of the branch and set a flag
once the sw breakpoint is hit restore the branch delay slot and remove traces of the sw breakpoints.
Parsing branches and jumps is fine (hence why i can get the targets) but in the conditional branches, once i have parsed, i then have to do the comparisons to determine whether to jump to the true part of go to the false (next line) which i feel is more work than i would like. Do I not??
My problem with my hacky method is:
Will the CPU have already stored that it has hit the conditional branch and have determined that after the branch delay slot has been executed whether it is going to take the branch or not, therefore once i point the Program Counter back to the branch and it gets executed instead of executing correctly it thinks it must jump to the true or false part of the branch which was pre-determined before the exception occurred? (try a "double jump")
do you got the MIPS programmers documents? if you want an 100% accurate answer read them - if not I can just tell you the important bits as I remember them.
in short - yes you need to load the instruction from memory, parse it and interpret the result to figure out where you have to continue. "Patching" the code as you expressed would work too, but you need to make sure the instruction cache gets invalidated, else you will be running from the cache and end in an infinite loop.
the updating of the PC follows after the delay slot has been executed, until then it will point to the branch. there is no special handling during an exception except you have a register which says if you are in a delay slot or not.
you`d need to emulate all instructions that can conditionally raise an exception in your handler (load/store) along with the branch instructions. if its another kind of instruction in the DS you just can restart at the branch (the exception was an external interrupt in this case).
if your concern is about performance then simply dont put exception-raising instructions in a delay slot.
edit: and no, MIPS stores nothing about interrupted instructions, but the method you are suggesting likely will be slower due to having to invalidate the ICache twice