How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS? - mips

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?

kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

Related

Why A and B registers are used in multicycle Datapath?

Why are registers A and B whose inputs are ReadData1 and ReadData2 of RegisterFile are necessary? Isn't it possible to use directly the values which are on ReadData1 and ReadData2 outputs of Register File?
Instruction Register is already loaded with an instruction, so the value of IR is fixed, which means that $rs, $rt, $rd reg numbers are obviously the same within a single instruction. Hence, there are always the same values on the ReadRegister1 and ReadRegister2 inputs of the Register File. So the values that are on RD1 and RD2 outputs are the same unless the corresponding registers inside RegisterFile are overwritten.
That means that A and B registers are necessary only for the instructions that require to have the values of $rs or $rd registers that were overwritten on previous cycle. Can anybody give me an example of such an instruction.
The general pattern is that during a clock cycle: at the start of the clock, some register(s) feed values to computational logic which feed values to (the same or other) register(s) by the end of the clock, so that it can start all over again for the next cycle.
In the single cycle datapath, the value in the PC starts the process of the cycle, and by the end of the cycle, the PC register is updated to repeat a new cycle with another value.  Along the way, the register file is both consulted and also (potentially) updated.  You may note that these A & B registers are not present in the single cycle datapath.
You are correct that those values do not change during the execution of any one single instruction on the multicycle datapath.
However, the multicycle processor uses multiple cycles to execute a single instruction (so that it can speed the clock).  In order to support the successive cycles in that processor design, some internal registers are used — they capture the output of a prior cycle in order for a next cycle to do something different.
The problem with the multicycle datapath diagrams is that they don't make it clear what part of the processor runs in what cycles.  Those A & B registers are there to support a cycle boundary, so the decode is happening in one cycle and the arithmetic/ALU in another cycle.  (Without those registers, the processor would have to perform decode again in the subsequent cycle, which would decrease the clock rate and defeat the nature of the multicycle datapath.)
Boundaries are made much more clear in pipeline datapath diagrams.  Search for "MIPS pipeline datapath".  (Note that some pipeline datapath diagrams show registers between stages and others simply outline what's in what stage without showing those registers.)  The large vertical bars are registers and they separate the pipeline stages.  Of course, the pipeline processor executes all pipeline stages in parallel, though in theory, similar boundaries are applicable to the cycles in a multicycle processor.  Note that the ID/EX pipeline register in the pipeline datapath serves the purpose of the A & B registers in the multicycle datapath.

winmips64 branch target buffer does 2 cycles stall

So i have been searching why on winmips64 there were 2 cycles stalls each time branch target buffer mispredict but i got nothing .So it mispredicts the first time and the last that bne is running .In the first time says its 2 cycles of branch taken stall and on last 2 cycles of branch misprediction stall any ideas?R12 is mentioned in another part of the code
lw R4,0(R3)
lw R8,0(R2)
dmul R8,R8,R4
daddi R3,R3,-8
daddi R11,R2,-8
dadd R9,R9,R8
daddi R2,R2,-8
bne R11,R12,loop
I don't know the winmips64 architecture specifically, i.e. how it is or isn't different from other MIPS pipelined implementations.  So, if someone else knows specifics, please correct me if I'm wrong.
That both the branch taken and mispredict cost 2 cycles is consistent with standard 5 stage pipeline, where the branch decision (taken/not taken) is fully resolved in EX stage, and thus has to flush the instructions in prior stages, IF and ID, when it's wrong.
The first time the branch executes, it is unknown to the branch predictor, and the processor appears to move forward with an assumption of not taken; thus, message about the branch taken stall.
The last time the branch executes, it is know to the branch predictor, however, the branch is not taken, which doesn't match the prediction as the predictor is assuming it will continue to loop back as it has just been doing.  Thus, the message of misprediction.

Organizing pipeline in MIPS

I'm unsure about how the following properties affect pipeline execution for a 5 stage MIPS design (IF, ID, EX, MEM, WB). I just need some clearing up.
only 1 memory port
no data fowarding.
Branch stalls until end of * stage
Does the 1 memory port mean we cannot fetch or write when we read/write to mem (i.e. MEM stage on lw,sw you can't enter IF or another MEM)?
With no forwarding does this means an instruction won't enter the ID stage until after or on the WB stage for the previous instruction it depends on?
Idk what the branch stall means
A common assumption is that you can write in the first half of a cycle, and read in the second half of a cycle.
Lets say than I1 is your first instruction and I2 your second instruction, and I2 is using a register that I1 is modifying.
Only 1 memory port.
This means that you cannot read or write memory at the same time in two different stages of the pipelines.
For instance, if I1 is at the MEM stage, another instruction cannot be at the IF stage at the same time, because both require memory access.
No data forwarding. Data forwarding reflects to the fact that at the end of EX stage for I1, you forward the data to the ID cycle of I2.
Consequently, no forwarding means that the pipeline has to wait for the WB stage of the I1 to go to ID stage of I2. With the asumption, you can go to ID stage at the same time as the WB stage of the previous instruction, because WB will write to memory during the first half of the cycle, and ID will read from memory during the second half of the cycle.
Branch stalls until end of EX stage. This is a common asumption, that doesn't use branch prediction techniques. It simply states that an instruction after a branch has to wait until the end of EX stage to start ID stage. Recall that the address of the next instruction to be executed is known only at the EX stage of the branch instruction.
Comment: IF and MEM access separate sections of memory. One is data memory (.data) and the other instruction memory (.code or .text). It is designed this way so that accessing memory during IF and MEM does not cause a structural stall.
The area which is used by .data is that used by the stack, with the stack is traditionally placed right at the "end" of the .data sector. This is why if you don't subtract from the stack pointer address before saving data to the stack you run the risk of overwriting your program code. As MIPS allows you to designate the stack address manually, some people choose put the stack a bit "before" the end in order to avoid problems if they know they will have space and not overwrite variables in MEM. For instance, placing the stack at 0x300 instead of 0x400 in WinMIPS64. I am not sure if that is good practice or not. But I have heard of people doing it.

Pipeline Hazards questions

I'm currently studying for an exam tomorrow and need some help understanding the following:
The following program is given:
ADDF R12, R13, R14
ADD R1,R8,R9
MUL R4,R2,R3
MUL R5,R6,R7
ADD R10,R5,R7
ADD R11,R2,R3
Find the potential conficts that can arise if the architecture has:
a) No pipeline
b) A Pipeline
c) Multiple pipelines
So for (b) I would say the instruction on line 5 is a Data Hazard because it fetches the value of R5 which is from the previous line given the result of a multiplication, so that instruction is not yet finished.
But what happens if an architecture doesn't have a pipeline? My best guess is that no hazards exist, but I'm not sure.
Also, what happens if it has 2 or more pipelines?
Cheers.
You are correct to suggest that for a) there are no hazards as each instruction must complete before the next starts.
For b):
There is a "Read After Write" dependency between lines 4 and 5.
There are "Read After Read" dependencies between lines 4 and 5 and also between lines 2 and 6.
I suspect that the difference between parts b) and c) is that the question assumes you know ahead of time that the pipe-line has a well defined number of stages. For example we know that if the pipe-line has 3 stages then the RAR dependency between lines 2 and 6 is irrelevant.
In a system with multiple pipelines however the system could fetch say 4 instructions per cycle making dependencies that were formally too far apart now potential hazards.

MIPS exception handling (Specifically branch delay slots)

Say an exception has been hit in the branch delay slot of a conditional branch
e.g.
BEQ a0, zero, _true
BREAK (0000)
sw a0, 0000(t0)
_true:
sw a1, 0000(t0)
My exception handler will pick up the exception type 9 from the BREAK instruction and set the BD bit of the CAUSE register to 1 as it is in the branch delay and the EPC will be the address of the branch.
The documentation says that this will require complex processing which isn't described. i.e. Getting the target of the branch/jump, doing any required comparison then setting the PC to the true or false address.
My solution to get around the complex processing (which is a bit of a hack) is as follows:
Store the instruction in the branch delay slot
NOP the instruction in the branch delay slot
Return from the exception handler restoring all registers
re-execute the *BEQ a0, zero, _true* and the branch delay will be a nop so it will have no effect
Place a sw breakpoint at the target(s) of the branch and set a flag
once the sw breakpoint is hit restore the branch delay slot and remove traces of the sw breakpoints.
Parsing branches and jumps is fine (hence why i can get the targets) but in the conditional branches, once i have parsed, i then have to do the comparisons to determine whether to jump to the true part of go to the false (next line) which i feel is more work than i would like. Do I not??
My problem with my hacky method is:
Will the CPU have already stored that it has hit the conditional branch and have determined that after the branch delay slot has been executed whether it is going to take the branch or not, therefore once i point the Program Counter back to the branch and it gets executed instead of executing correctly it thinks it must jump to the true or false part of the branch which was pre-determined before the exception occurred? (try a "double jump")
do you got the MIPS programmers documents? if you want an 100% accurate answer read them - if not I can just tell you the important bits as I remember them.
in short - yes you need to load the instruction from memory, parse it and interpret the result to figure out where you have to continue. "Patching" the code as you expressed would work too, but you need to make sure the instruction cache gets invalidated, else you will be running from the cache and end in an infinite loop.
the updating of the PC follows after the delay slot has been executed, until then it will point to the branch. there is no special handling during an exception except you have a register which says if you are in a delay slot or not.
you`d need to emulate all instructions that can conditionally raise an exception in your handler (load/store) along with the branch instructions. if its another kind of instruction in the DS you just can restart at the branch (the exception was an external interrupt in this case).
if your concern is about performance then simply dont put exception-raising instructions in a delay slot.
edit: and no, MIPS stores nothing about interrupted instructions, but the method you are suggesting likely will be slower due to having to invalidate the ICache twice