Query about MIPS R3051 pipeline behaviour (MIPS-I architecture) - mips

I am currently implementing a MIPS R3051 in software as part of my university project.
I notice in the programmers manual from IDT it specifies that computational instructions can access the results of other computational instructions ahead of them in the pipeline at their RD stage, even though the ahead instruction has not yet committed its results to the relevant register in the WB stage. This is done via "special logic within the execution engine" to prevent a stall being necessary.
My query is does this also apply to non-computational instructions (like a jump-type instruction for example)?
An example: if an ADD instruction calculates a value at its ALU stage destined for r1, with a JR [r1] instruction behind it in the pipeline at RD, will the JR instruction get:
(a) the old contents of r1
or
(b) will this "special logic" allow the new value of r1 to be forwarded to it? or
(c) will the pipeline stall until r1 has been committed properly at WB?
Apologies if this is asked elsewhere (I have not spotted it). Many thanks.
Regards,
Phil

The key here is to keep well in mind that this "special logic" is only an optimization: it makes things faster, here bypassing something so to avoid a stall, but it must still insure that the result is unchanged. Otherwise it would be impossible or at least to difficult to program with this hardware.
So, to answer your question, you will see either case (b) or (c) but never case (a).

Related

MIPS and interlocked stages

I'm studying the MIPS architecture, but i really don't understand some concepts..
For example, the "without interlocked pipelined stages".
I've also red that only first implementations of MIPS didn't have it (interlocked pipelined stages). But i didn't find with witch one it was introduced, can someone tell me in witch one it was introduced?
I want to focus on "interlocked stages". I've understand that this concept means, for example, adds explicit nop operations that delays the execution of an instruction (for example a branch). Is that true? or my conclusion is poor?
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
Thank you all in advice!
adds explicit nop operations that delays the execution of an instruction
Rather than "adding a nop", the instruction in a delay slot is executed with the understanding that the architectural state is delayed — that the effect of the immediately prior instruction won't be see by the instruction in the delay slot.
It goes to having software do the "interlocking" instead of the hardware.
When the hardware doesn't interlock, delay slots are exposed to software and so software must accommodate, by shuffling code, or if it can't find anything useful to do in the delay stot, then filling that delay slot with a nop instruction.
See https://en.m.wikipedia.org/wiki/Delay_slot :
A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.
From https://en.wikipedia.org/wiki/MIPS_architecture#MIPS_II :
MIPS II removed the load delay slot
Although not totally clear from those wikipedia articles, the branch delay slot apparently slowly disappeared, first in MIPS II, by providing versions of branches that only executed the delay slot instruction on taken branch, then in MIPS32, "a new family of branches with no delay slot".
In summary, delay slots have been abandoned, as they really only worked for specific micro architectures; since as the technology evolved, those designs features became challenges rather than offering their original benefit.
Morover, if the first version of MIPS didn't have the "interlocked stages", how did it manage the branch prediction?
As I understand it, MIPS I didn't really do dynamic branch prediction but rather that it would simply assume not-taken branch, however, by delaying the execution of a branch for one instruction, it reduced the cost of assuming a not-taken branch when the branch was actually taken.  It also supported only very simple conditional branch instructions (e.g. equal or not equal), such that the branch taken/not-taken computation could be executed earlier in the pipeline, perhaps as early as in the ID stage.

In which pipeline stage is branch decision been made?

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".
There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).
Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)
Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.
This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.
(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry.
RISC-V assumes branch-prediction will hide branch delays.)
The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.
Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.
Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.
Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.
Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).
This originally wasn't a MIPS question at all, just generic classic
5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:
Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.
And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

MIPS: why is ISR surrounded with rdpgpr $sp, $sp; wrpgpr $sp, $sp instructions?

I'm working with PIC32 MCUs (MIPS M4K core), I'm trying to understand how do interrupts work in MIPS; I'm armed with "See MIPS Run" book, official MIPS reference and Google. No one of them can help me understand the following:
I have interrupt declared like this:
void __ISR(_CORE_TIMER_VECTOR) my_int_handler(void)
I look at disassembly, and I see that RDPGPR SP, SP is called in the ISR prologue (first instruction, actually); and balancing WRPGPR SR, SR instruction is called in the ISR epilogue (before writing previously-saved Status register to CP0 and calling ERET).
I see that these instruction purposes are to read from and save to previous shadow register set, so, RDPGPR SP, SP reads $sp from shadow register set and WRPGPR SR, SR writes it back, but I can't understand the reason for this. This ISR intended not to use shadow register set, and actually in disassembly I see that context is saved to the stack. But, for some reason, $sp is read from and written to shadow $sp. Why is this?
And, related question: is there some really comprehensive resource (book, or something) on MIPS assembly language? "See MIPS Run" seems really good, it's great starting point for me to dig into MIPS architecture, but it does not cover several topics good enough, several things off the top of my head:
Very little information about EIC (external interrupt controller) mode: it has the diagram with Cause register that shows that in EIC mode we have RIPL instead of IP7-2, but there is nothing about how does it work (say, that interrupt is caused if only Cause->RIPL is more than Status->IPL. There's even no explanation what RIPL does mean ("Requested Interrupt Priority Level", well, Google helped). I understand that EIC is implementation-dependent, but the things I just mentioned are generic.
Assembly language is covered not completely enough: say, nothing about macro (.macro, .endm directives), I couldn't find anything about some assembler directives I've seen in the existing code, say, .set mips32r2, and so on.
I cant find anything about using rdpgpr/wrpgpr in the ISR, it covers these instructions (and shadow register sets in general) very briefly
Official MIPS reference doesn't help much in these topics as well. Is there really good book that covers all possible assembly directives, and so on?
When the MIPS core enters an ISR it can swap the interrupted code's active register set with a new one (there can be several different shadow register sets), specific for that interrupt priority.
Usually the interrupt routines don't have a stack of their own, and because the just switched-in shadow register set certainly have its sp register with a different value than the interrupted code's, the ISR copies the sp value from the just switched-out shadow register set to its own, to be able to use the interrupted code's stack.
If you wish, you could set your ISR's stack to a previously allocated stack of its own, but that is usually not useful.

How does a zero register improve performance?

In the MIPS ISA, there's a zero register ($r0) which always gives a value of zero. This allows the processor to:
Any instruction which produces result that is to be discarded can direct its target to this register
To be a source of 0
It is said in this source that this improved the speed of the CPU. How does it improve performance? And what are the reasons why not all ISA adopt this zero register?
$r0 is not general purpose. It is hardwired to 0. No matter what you
do to this register, it always has a value of 0. You might wonder why
such a register is needed in MIPS.
The designers of MIPS used benchmarks (programs used to determine the
performance of a CPU), which convinced them that having a register
hardwired to 0 would improve the performance (speed) of the CPU as
opposed to not having it. Not everyone agrees a register hardwired to
0 is essential, so not all ISAs have a zero register.
There's a few potential ways that this can improve performance; it's not clear which ones apply to that particular processor, but I've listed them roughly in order from most to least likely.
It avoids spurious pipeline stalls. Without an explicit zero register, it's necessary to take a register, zero it out, and use its value. This means that the zero-using operation is dependent on the zeroing operation, and (depending on how powerful the pipeline forwarding system is) possibly on the zeroed register's previous value. Architectures like x86, which have quite small register files and basically virtualize their registers to keep that from causing problems, have extremely powerful hazard analysis tools. The same is not generally true of RISC processors.
Certain operations may be more pipelineable if they can avoid a register read. If an explicit zero register is used, the fact that the operand will be zero is known at the instruction decode stage, rather than later on in the register fetch stage. Thus, the register read stage can be skipped.
Similarly, the ability to explicitly discard results avoids the need for a register write stage.
Certain operations may generate simpler microcode when one of their operands is known to be zero, or when the result is known to be discarded.
An explicit zero register takes some pressure off the compiler's optimizer, as it doesn't need to be as careful with its register assignment (no need to identify a register which won't cause a stall on read or write).
For each of your items, here's an answer.
Consider instructions that compulsory take a register for output, where you want to discard this output. Normally, you'd have to make sure that you have a free register available, and if not, push some of your current registers onto the stack, which is a costly operation. Evidently, it happens a lot that the output of operations is discarded, and the easiest way to deal with this is to have a 'unused' register available.
Now that we have such an unused register, why not use it? It happens a lot that you want to zero-initialize something or compare something to zero. The long way is to first write zero to that register (which requires an extra instruction and the literal for zero in your machine code, which may be of the form 0x00000000 which is rather long) and then use it. So, one instruction shaved off and a little bit of your program size as well.
These optimizations may seem a bit trivial and may raise the question 'how much does that actually improve anything?' The answer here is that the operations described above are apparently used a lot on your MIPS processor.
The concept of a zero register is not new. I first encountered it on a CDC 6600 mainframe, which dates back to the mid-to-late 1960's. In some ways it was one of the first RISC processors, and was the world's fastest computer for 5 years. In that architecture, the "B0" register was hardwired to always be zero. http://en.wikipedia.org/wiki/CDC_6600
The benefit of such a register is primarily that it simplified the instruction set. When the decoding and orchestration of simple and regular instruction sets can be implemented without microcode, it increases performance. In addition, for the 6600 like most LSI chips today, the time spent for a signal to travel the length a "wire" becomes on of the key factors in execution speed, and keeping the instruction set simple (and avoiding microcode) allows less transistors, and results in shorter circuit paths.
A zero register allows saving some opcodes when designing a new
instruction set architecture (ISA).
For example, the main RISC-V spec has 32 pseudo-instructions that
depend on the zero register (cf. Tables 26.2 and 26.3). A pseudo-instruction is an
instruction that is mapped by the assembler to another real
instruction (for example, branch-if-equal-to-zero is mapped to
branch-if-equal). For comparison: the main RISV-V spec lists 164
real instruction opcodes (i.e. counting RV(32|64)[IMAFD] base/extensions, a.k.a. RV64G). That means without a zero register RISC-V RV64G would occupy 32 more opcodes for those instructions (i.e. 20 % more). For a concrete RISC-V CPU
implementation, this real-to-pseudo instruction ratio may shift in either direction
depending on which extensions are selected.
Having less opcodes simplifies the instruction decoder.
A more complex decoder needs more time for decoding instructions
or occupies more gates (that can't be used for more useful CPU units)
or both.
Existing, incrementally developed ISAs have to deal with
backwards-compatibility. Thus, if your original ISA design
doesn't include a zero register, you can't just add it in a later
revision without breaking compatibility. Also, if your existing
ISA already requires a very complex decoder, adding then a zero
register doesn't pay off.
Besides the modern RISC-V ISA (developed since 2010, first
ratification in 2019), ARMv8 AArch64 (a 64 Bit ISA released in 2011),
in contrast to the previous ARM 32 bit ISAs, also features a zero register. Because of this and other changes
AArch64 ISA has much less in common with previous ARM 32 Bit
ISAs than - say - x86 and x86-64 ISAs.
In contrast to AArch64, x86-64
doesn't has a zero register. Although x86-64 is more modern than
the previous 32 bit x86 ISA, its ISA only changed incrementally.
Thus, it features all the existing x86 opcodes plus 64 bit
variants, and thus the decoder already is very complex.

MIPS forwarding implementation (tough)

I think I understand the first part
(i). I at least have answers for this. I am not sure about where this implementation would fail though, for part ii? Part ii has me completely stumped. Does anyone know situations where this would fail?
If you want to shine some light on part iii you would be my entire classes hero. Were all stumped there. Thanks for any input.
Tim FlimFlam, the infamous architect of the MN-4363 processor, is struggling with a pipelined implementation of the basic MIPS ISA.
(i) To implement forwarding, Tim connected the output of logic from EX and MEM stages (these logic outputs represent inputs to EXMEM and MEMWB latches, respectively) to the input of IDEX register. He claims that he will be able to cover any dependency in this manner.
• Would this implementation work?
• Would he need to insert any muxes? Explain for
1. the producer instruction is a load.
2. the producer instruction is of R-type. 3. the consumer instruction is of R-type. 4. the consumer instruction is a branch. 5. the consumer instruction is a store.
(ii) Tim claims that forwarding to EX stage only suffices to cover all dependencies.
• Provide two examples where his implementation would fail.
• Would “fail” in this case correspond to breaking correctness constraints?
(iii) Tim tries to identify the minimum amount of information to be transferred acros pipeline stages. Considering R-type, data transfer, and branch instructions, explain how wide each pipeline register should be, demarcating different fields per latch.
Not sure if this is late, but the answer rests in "all dependencies" in part 2. Dependencies/hazards are of multiple types, viz, control, data. Some data hazards can be fixed by forwarding (from the MEM && WB stages to execute stage. Other data hazards like LOAD dependency is not possible to fix by forwarding. To see why this happens, note that a LOAD instruction in the MEM stage will have the output ready from the memory only in the end of that clock cycle. In that same clock cycle, any intstruction in the execute stage which requires the value of the LOAD instruction will get the incorrect value. In such a scenario, at any instant of time within the clock cycle say beginning, the alu is beginning to execute while the memory is 'beginning' to fetch the data. At the end of the cycle, while the memory has finished fetching the data, the alu has also finished computing with the wrong values. To prevent hazards, you need alu to be beginning computing while the data memory has finished fetching (i.e the alu must stall for 1 cycle or you must have a nop between LOAD and ALU instrcution. Hope this helps!