In which pipeline stage is branch decision been made? - mips

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".

There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).
Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)
Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.
This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.
(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry.
RISC-V assumes branch-prediction will hide branch delays.)
The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.
Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.
Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.
Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.
Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).
This originally wasn't a MIPS question at all, just generic classic
5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:
Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.
And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

Related

MIPS and interlocked stages

I'm studying the MIPS architecture, but i really don't understand some concepts..
For example, the "without interlocked pipelined stages".
I've also red that only first implementations of MIPS didn't have it (interlocked pipelined stages). But i didn't find with witch one it was introduced, can someone tell me in witch one it was introduced?
I want to focus on "interlocked stages". I've understand that this concept means, for example, adds explicit nop operations that delays the execution of an instruction (for example a branch). Is that true? or my conclusion is poor?
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
Thank you all in advice!
adds explicit nop operations that delays the execution of an instruction
Rather than "adding a nop", the instruction in a delay slot is executed with the understanding that the architectural state is delayed — that the effect of the immediately prior instruction won't be see by the instruction in the delay slot.
It goes to having software do the "interlocking" instead of the hardware.
When the hardware doesn't interlock, delay slots are exposed to software and so software must accommodate, by shuffling code, or if it can't find anything useful to do in the delay stot, then filling that delay slot with a nop instruction.
See https://en.m.wikipedia.org/wiki/Delay_slot :
A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.
From https://en.wikipedia.org/wiki/MIPS_architecture#MIPS_II :
MIPS II removed the load delay slot
Although not totally clear from those wikipedia articles, the branch delay slot apparently slowly disappeared, first in MIPS II, by providing versions of branches that only executed the delay slot instruction on taken branch, then in MIPS32, "a new family of branches with no delay slot".
In summary, delay slots have been abandoned, as they really only worked for specific micro architectures; since as the technology evolved, those designs features became challenges rather than offering their original benefit.
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
As I understand it, MIPS I didn't really do dynamic branch prediction but rather that it would simply assume not-taken branch, however, by delaying the execution of a branch for one instruction, it reduced the cost of assuming a not-taken branch when the branch was actually taken.  It also supported only very simple conditional branch instructions (e.g. equal or not equal), such that the branch taken/not-taken computation could be executed earlier in the pipeline, perhaps as early as in the ID stage.

What does the TargetWrite/IorD Control Line do on a multicycle MIPS processer

We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...
First, let's note that this block diagram does not have separate instruction memory and data memory.  That means that it either has a unified cache or goes directly to memory.  Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache).  The advantage of this is that the processor can read instructions and read/write data in parallel.  In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from.  For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do.  Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment.  Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC.  However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed?  It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not.  In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction.  In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC.  Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.

Why does MIPS use one delay slot instead of two?

This seems to be the case in many RISC architectures. Since filling one delay slot saves us 50% of otherwise wasted cycles, why not give the programmer a chance to use both slots?
On MIPS R2000, the classic MIPS I that the ISA is designed around, 1 branch-delay slot is enough to hide branch latency: How does MIPS I forward from EX to ID for branches without stalling?.
Being able to check branch conditions with low latency (half a clock cycle) is why MIPS conditional branches are limited to eq/ne and/or checking the sign bit for x<0, not an arbitrary x<y.
On Is that true if we can always fill the delay slot there is no need for branch prediction? Paul Clayton answers that yes, filling the branch-delay slot in asm makes branch-prediction useless on early MIPS. So that's more evidence that real commercial MIPS R2000 worked that way. You only need branch prediction if you lengthen the pipeline, like they did for next-gen MIPS R4000. And then you have branch prediction instead of expecting compilers to statically fill the branch latency.
But anyway, one branch-delay slot is hard enough for compilers to fill reliably, and in hindsight is baggage for future implementations that use branch prediction instead of depending on a delay slot to hide branch latency. Especially superscalar implementations.
Making it architectural burdens all future CPUs with it, if they want to be binary-compatible. So even if early MIPS CPUs did have 2 cycles of branch latency, some foresight and cost/benefit consideration makes is pretty reasonable to only expose 1 cycle architecturally.

What's the role of EX stage for branching in Pipelined MIPS w Forwarding?

Consider the following Pipelined Processor structure:
Notice that the condition test for branching (the = circuit), as well as the target address calculation for the next instruction in case of branch taken are executed in the ID phase - as a way to save on stalls/flushes (as opposed to doing all that in the EX phase and forwarding the results in the MEM phase of the given branch instruction).
Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
Thank you in advance.
When the beq presents a control hazard, the pipelined processor does not know in advance what instruction to fetch next,because the branch decision has not been made by the time the next instruction is fetched.
Because the decision is made in the MEM-stage we need to stall the pipeline for three cycles at every branch which of course effect the system performance.
another way is to predict whether the branch will be taken and begin to executing instructions based on the prediction.Once the branch decision is made and is available, the processor can throw out(flushes) the instructions if the prediction was wrong (this called branch misprediction penalty) which also effect the performance.
To reduce the branch misprediction penalty one could make the branch decision made earlier.
Making the decision simply requires comparing two registers. using a dedicated equality comparator is faster than performing a subtraction and zero detection. If the comparator is fast enough, it could be moved back into the Decode stage, so that the operands are read from the register file and compared to determine the next PC by the end of the Decode stage.
Unfortunately the early branch decision hardware introduce a new RAW data hazard.
Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
The branch instruction is decoded and resolved in the Decode stage only and we do not wait for it to go to the EX stage.
Like you pointed out in the question, both the branch result and the target address calculation is done in the DEC stage. The hardware takes care of the RAW hazards by forwarding the required data from the correct stages (notice the little mux'es right after the RegFile is read). As a result the branch equality check sees the correct operands and the result drives the PCSrcD signal. This signal further decides the output of the first Mux in the diagram (which essentially decides between PC+4 or Branch Target. Hence it becomes safe and quick to do this in the DEC stage itself.
Also, none of the branch instruction related signals (PCSrcD, BranchD, PCBranchD) make it to the EX stage. If you see the inputs to the ISS/EX register, it doesn't take in any of the above mentioned signals. Hence, the information isn't passed the to EX stage and the branch is completely resolved and retired at the end of the DEC stage itself.

How does a zero register improve performance?

In the MIPS ISA, there's a zero register ($r0) which always gives a value of zero. This allows the processor to:
Any instruction which produces result that is to be discarded can direct its target to this register
To be a source of 0
It is said in this source that this improved the speed of the CPU. How does it improve performance? And what are the reasons why not all ISA adopt this zero register?
$r0 is not general purpose. It is hardwired to 0. No matter what you
do to this register, it always has a value of 0. You might wonder why
such a register is needed in MIPS.
The designers of MIPS used benchmarks (programs used to determine the
performance of a CPU), which convinced them that having a register
hardwired to 0 would improve the performance (speed) of the CPU as
opposed to not having it. Not everyone agrees a register hardwired to
0 is essential, so not all ISAs have a zero register.
There's a few potential ways that this can improve performance; it's not clear which ones apply to that particular processor, but I've listed them roughly in order from most to least likely.
It avoids spurious pipeline stalls. Without an explicit zero register, it's necessary to take a register, zero it out, and use its value. This means that the zero-using operation is dependent on the zeroing operation, and (depending on how powerful the pipeline forwarding system is) possibly on the zeroed register's previous value. Architectures like x86, which have quite small register files and basically virtualize their registers to keep that from causing problems, have extremely powerful hazard analysis tools. The same is not generally true of RISC processors.
Certain operations may be more pipelineable if they can avoid a register read. If an explicit zero register is used, the fact that the operand will be zero is known at the instruction decode stage, rather than later on in the register fetch stage. Thus, the register read stage can be skipped.
Similarly, the ability to explicitly discard results avoids the need for a register write stage.
Certain operations may generate simpler microcode when one of their operands is known to be zero, or when the result is known to be discarded.
An explicit zero register takes some pressure off the compiler's optimizer, as it doesn't need to be as careful with its register assignment (no need to identify a register which won't cause a stall on read or write).
For each of your items, here's an answer.
Consider instructions that compulsory take a register for output, where you want to discard this output. Normally, you'd have to make sure that you have a free register available, and if not, push some of your current registers onto the stack, which is a costly operation. Evidently, it happens a lot that the output of operations is discarded, and the easiest way to deal with this is to have a 'unused' register available.
Now that we have such an unused register, why not use it? It happens a lot that you want to zero-initialize something or compare something to zero. The long way is to first write zero to that register (which requires an extra instruction and the literal for zero in your machine code, which may be of the form 0x00000000 which is rather long) and then use it. So, one instruction shaved off and a little bit of your program size as well.
These optimizations may seem a bit trivial and may raise the question 'how much does that actually improve anything?' The answer here is that the operations described above are apparently used a lot on your MIPS processor.
The concept of a zero register is not new. I first encountered it on a CDC 6600 mainframe, which dates back to the mid-to-late 1960's. In some ways it was one of the first RISC processors, and was the world's fastest computer for 5 years. In that architecture, the "B0" register was hardwired to always be zero. http://en.wikipedia.org/wiki/CDC_6600
The benefit of such a register is primarily that it simplified the instruction set. When the decoding and orchestration of simple and regular instruction sets can be implemented without microcode, it increases performance. In addition, for the 6600 like most LSI chips today, the time spent for a signal to travel the length a "wire" becomes on of the key factors in execution speed, and keeping the instruction set simple (and avoiding microcode) allows less transistors, and results in shorter circuit paths.
A zero register allows saving some opcodes when designing a new
instruction set architecture (ISA).
For example, the main RISC-V spec has 32 pseudo-instructions that
depend on the zero register (cf. Tables 26.2 and 26.3). A pseudo-instruction is an
instruction that is mapped by the assembler to another real
instruction (for example, branch-if-equal-to-zero is mapped to
branch-if-equal). For comparison: the main RISV-V spec lists 164
real instruction opcodes (i.e. counting RV(32|64)[IMAFD] base/extensions, a.k.a. RV64G). That means without a zero register RISC-V RV64G would occupy 32 more opcodes for those instructions (i.e. 20 % more). For a concrete RISC-V CPU
implementation, this real-to-pseudo instruction ratio may shift in either direction
depending on which extensions are selected.
Having less opcodes simplifies the instruction decoder.
A more complex decoder needs more time for decoding instructions
or occupies more gates (that can't be used for more useful CPU units)
or both.
Existing, incrementally developed ISAs have to deal with
backwards-compatibility. Thus, if your original ISA design
doesn't include a zero register, you can't just add it in a later
revision without breaking compatibility. Also, if your existing
ISA already requires a very complex decoder, adding then a zero
register doesn't pay off.
Besides the modern RISC-V ISA (developed since 2010, first
ratification in 2019), ARMv8 AArch64 (a 64 Bit ISA released in 2011),
in contrast to the previous ARM 32 bit ISAs, also features a zero register. Because of this and other changes
AArch64 ISA has much less in common with previous ARM 32 Bit
ISAs than - say - x86 and x86-64 ISAs.
In contrast to AArch64, x86-64
doesn't has a zero register. Although x86-64 is more modern than
the previous 32 bit x86 ISA, its ISA only changed incrementally.
Thus, it features all the existing x86 opcodes plus 64 bit
variants, and thus the decoder already is very complex.