Why does MIPS use one delay slot instead of two? - mips

This seems to be the case in many RISC architectures. Since filling one delay slot saves us 50% of otherwise wasted cycles, why not give the programmer a chance to use both slots?

On MIPS R2000, the classic MIPS I that the ISA is designed around, 1 branch-delay slot is enough to hide branch latency: How does MIPS I forward from EX to ID for branches without stalling?.
Being able to check branch conditions with low latency (half a clock cycle) is why MIPS conditional branches are limited to eq/ne and/or checking the sign bit for x<0, not an arbitrary x<y.
On Is that true if we can always fill the delay slot there is no need for branch prediction? Paul Clayton answers that yes, filling the branch-delay slot in asm makes branch-prediction useless on early MIPS. So that's more evidence that real commercial MIPS R2000 worked that way. You only need branch prediction if you lengthen the pipeline, like they did for next-gen MIPS R4000. And then you have branch prediction instead of expecting compilers to statically fill the branch latency.
But anyway, one branch-delay slot is hard enough for compilers to fill reliably, and in hindsight is baggage for future implementations that use branch prediction instead of depending on a delay slot to hide branch latency. Especially superscalar implementations.
Making it architectural burdens all future CPUs with it, if they want to be binary-compatible. So even if early MIPS CPUs did have 2 cycles of branch latency, some foresight and cost/benefit consideration makes is pretty reasonable to only expose 1 cycle architecturally.

Related

MIPS and interlocked stages

I'm studying the MIPS architecture, but i really don't understand some concepts..
For example, the "without interlocked pipelined stages".
I've also red that only first implementations of MIPS didn't have it (interlocked pipelined stages). But i didn't find with witch one it was introduced, can someone tell me in witch one it was introduced?
I want to focus on "interlocked stages". I've understand that this concept means, for example, adds explicit nop operations that delays the execution of an instruction (for example a branch). Is that true? or my conclusion is poor?
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
Thank you all in advice!
adds explicit nop operations that delays the execution of an instruction
Rather than "adding a nop", the instruction in a delay slot is executed with the understanding that the architectural state is delayed — that the effect of the immediately prior instruction won't be see by the instruction in the delay slot.
It goes to having software do the "interlocking" instead of the hardware.
When the hardware doesn't interlock, delay slots are exposed to software and so software must accommodate, by shuffling code, or if it can't find anything useful to do in the delay stot, then filling that delay slot with a nop instruction.
See https://en.m.wikipedia.org/wiki/Delay_slot :
A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.
From https://en.wikipedia.org/wiki/MIPS_architecture#MIPS_II :
MIPS II removed the load delay slot
Although not totally clear from those wikipedia articles, the branch delay slot apparently slowly disappeared, first in MIPS II, by providing versions of branches that only executed the delay slot instruction on taken branch, then in MIPS32, "a new family of branches with no delay slot".
In summary, delay slots have been abandoned, as they really only worked for specific micro architectures; since as the technology evolved, those designs features became challenges rather than offering their original benefit.
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
As I understand it, MIPS I didn't really do dynamic branch prediction but rather that it would simply assume not-taken branch, however, by delaying the execution of a branch for one instruction, it reduced the cost of assuming a not-taken branch when the branch was actually taken.  It also supported only very simple conditional branch instructions (e.g. equal or not equal), such that the branch taken/not-taken computation could be executed earlier in the pipeline, perhaps as early as in the ID stage.

In which pipeline stage is branch decision been made?

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".
There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).
Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)
Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.
This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.
(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry.
RISC-V assumes branch-prediction will hide branch delays.)
The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.
Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.
Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.
Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.
Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).
This originally wasn't a MIPS question at all, just generic classic
5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:
Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.
And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

Is that true if we can always fill the delay slot there is no need for branch prediction?

I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?
For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).
The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)
within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.

How does a zero register improve performance?

In the MIPS ISA, there's a zero register ($r0) which always gives a value of zero. This allows the processor to:
Any instruction which produces result that is to be discarded can direct its target to this register
To be a source of 0
It is said in this source that this improved the speed of the CPU. How does it improve performance? And what are the reasons why not all ISA adopt this zero register?
$r0 is not general purpose. It is hardwired to 0. No matter what you
do to this register, it always has a value of 0. You might wonder why
such a register is needed in MIPS.
The designers of MIPS used benchmarks (programs used to determine the
performance of a CPU), which convinced them that having a register
hardwired to 0 would improve the performance (speed) of the CPU as
opposed to not having it. Not everyone agrees a register hardwired to
0 is essential, so not all ISAs have a zero register.
There's a few potential ways that this can improve performance; it's not clear which ones apply to that particular processor, but I've listed them roughly in order from most to least likely.
It avoids spurious pipeline stalls. Without an explicit zero register, it's necessary to take a register, zero it out, and use its value. This means that the zero-using operation is dependent on the zeroing operation, and (depending on how powerful the pipeline forwarding system is) possibly on the zeroed register's previous value. Architectures like x86, which have quite small register files and basically virtualize their registers to keep that from causing problems, have extremely powerful hazard analysis tools. The same is not generally true of RISC processors.
Certain operations may be more pipelineable if they can avoid a register read. If an explicit zero register is used, the fact that the operand will be zero is known at the instruction decode stage, rather than later on in the register fetch stage. Thus, the register read stage can be skipped.
Similarly, the ability to explicitly discard results avoids the need for a register write stage.
Certain operations may generate simpler microcode when one of their operands is known to be zero, or when the result is known to be discarded.
An explicit zero register takes some pressure off the compiler's optimizer, as it doesn't need to be as careful with its register assignment (no need to identify a register which won't cause a stall on read or write).
For each of your items, here's an answer.
Consider instructions that compulsory take a register for output, where you want to discard this output. Normally, you'd have to make sure that you have a free register available, and if not, push some of your current registers onto the stack, which is a costly operation. Evidently, it happens a lot that the output of operations is discarded, and the easiest way to deal with this is to have a 'unused' register available.
Now that we have such an unused register, why not use it? It happens a lot that you want to zero-initialize something or compare something to zero. The long way is to first write zero to that register (which requires an extra instruction and the literal for zero in your machine code, which may be of the form 0x00000000 which is rather long) and then use it. So, one instruction shaved off and a little bit of your program size as well.
These optimizations may seem a bit trivial and may raise the question 'how much does that actually improve anything?' The answer here is that the operations described above are apparently used a lot on your MIPS processor.
The concept of a zero register is not new. I first encountered it on a CDC 6600 mainframe, which dates back to the mid-to-late 1960's. In some ways it was one of the first RISC processors, and was the world's fastest computer for 5 years. In that architecture, the "B0" register was hardwired to always be zero. http://en.wikipedia.org/wiki/CDC_6600
The benefit of such a register is primarily that it simplified the instruction set. When the decoding and orchestration of simple and regular instruction sets can be implemented without microcode, it increases performance. In addition, for the 6600 like most LSI chips today, the time spent for a signal to travel the length a "wire" becomes on of the key factors in execution speed, and keeping the instruction set simple (and avoiding microcode) allows less transistors, and results in shorter circuit paths.
A zero register allows saving some opcodes when designing a new
instruction set architecture (ISA).
For example, the main RISC-V spec has 32 pseudo-instructions that
depend on the zero register (cf. Tables 26.2 and 26.3). A pseudo-instruction is an
instruction that is mapped by the assembler to another real
instruction (for example, branch-if-equal-to-zero is mapped to
branch-if-equal). For comparison: the main RISV-V spec lists 164
real instruction opcodes (i.e. counting RV(32|64)[IMAFD] base/extensions, a.k.a. RV64G). That means without a zero register RISC-V RV64G would occupy 32 more opcodes for those instructions (i.e. 20 % more). For a concrete RISC-V CPU
implementation, this real-to-pseudo instruction ratio may shift in either direction
depending on which extensions are selected.
Having less opcodes simplifies the instruction decoder.
A more complex decoder needs more time for decoding instructions
or occupies more gates (that can't be used for more useful CPU units)
or both.
Existing, incrementally developed ISAs have to deal with
backwards-compatibility. Thus, if your original ISA design
doesn't include a zero register, you can't just add it in a later
revision without breaking compatibility. Also, if your existing
ISA already requires a very complex decoder, adding then a zero
register doesn't pay off.
Besides the modern RISC-V ISA (developed since 2010, first
ratification in 2019), ARMv8 AArch64 (a 64 Bit ISA released in 2011),
in contrast to the previous ARM 32 bit ISAs, also features a zero register. Because of this and other changes
AArch64 ISA has much less in common with previous ARM 32 Bit
ISAs than - say - x86 and x86-64 ISAs.
In contrast to AArch64, x86-64
doesn't has a zero register. Although x86-64 is more modern than
the previous 32 bit x86 ISA, its ISA only changed incrementally.
Thus, it features all the existing x86 opcodes plus 64 bit
variants, and thus the decoder already is very complex.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.
Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).
Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.