What does the TargetWrite/IorD Control Line do on a multicycle MIPS processer - mips

We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...

First, let's note that this block diagram does not have separate instruction memory and data memory.  That means that it either has a unified cache or goes directly to memory.  Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache).  The advantage of this is that the processor can read instructions and read/write data in parallel.  In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from.  For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do.  Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment.  Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC.  However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed?  It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not.  In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction.  In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC.  Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.

Related

implementing jump register (jr, sll, slti) in multicycle

I've been asked to sketch a multicycle datapath and control unit for the instructions (js, sll, slti) all together. and draw the main controller FSM for these 3 instructions.
I'm struggle with it, I know how to make it with single cycle datapath but not for multicycle.
please, help
If you know the single cycle datapath, and want to take that to multi cycle, generally speaking we subdivide the single cycle into stages.
As only one stage will execute at a time, a relatively simple state machine controls what stage to execute currently/next to activate the appropriate stage, and deactivate the others.
The stages would be similar to the stages of a pipelined processor, namely those corresponding to Instruction Fetch, Decode, Execute, Memory, and Writeback.
In a pipelined machine, all the instructions need to go through all the stages, however, in a multicycle machine, certain instructions can skip certain states, and this should be manifest in the state machine. For example, while all instructions share Instruction Fetch and Decode, only loads and stores interact with the Data Memory, so any other instruction can skip the Mem stage.
The simplest possible state machine simply chooses all the states in succession, for every instruction, without ever consulting the instruction. This would fail to take advantage of the ability to skip stages for certain instructions that don't need those stages, and of course, you're being asked to do more than the simplest state machine.
A better state machine might start with Instruction Fetch as the active state, go to Decode state next, and by then it can use knowledge of the actual instruction to decide which of the remaining stages to skip for that given instruction.
You can imagine that the Control logic fetches some bits that inform the state machine of which stages can be skipped for any given opcode. There are only three states/stages left, so all that is needed from Control is a boolean for each.
For more information, especially on the stages and what they do, I suggest searching on Pipelined/Pipelining MIPS, as I believe there is more material on pipelined processors than on the multicycle design. You won't be concerned with pipeline hazzards, but the break down of the single cycle design into the stages might help.

In which pipeline stage is branch decision been made?

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".
There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).
Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)
Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.
This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.
(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry.
RISC-V assumes branch-prediction will hide branch delays.)
The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.
Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.
Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.
Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.
Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).
This originally wasn't a MIPS question at all, just generic classic
5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:
Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.
And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

Is that true if we can always fill the delay slot there is no need for branch prediction?

I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?
For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).
The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)
within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.

How does a zero register improve performance?

In the MIPS ISA, there's a zero register ($r0) which always gives a value of zero. This allows the processor to:
Any instruction which produces result that is to be discarded can direct its target to this register
To be a source of 0
It is said in this source that this improved the speed of the CPU. How does it improve performance? And what are the reasons why not all ISA adopt this zero register?
$r0 is not general purpose. It is hardwired to 0. No matter what you
do to this register, it always has a value of 0. You might wonder why
such a register is needed in MIPS.
The designers of MIPS used benchmarks (programs used to determine the
performance of a CPU), which convinced them that having a register
hardwired to 0 would improve the performance (speed) of the CPU as
opposed to not having it. Not everyone agrees a register hardwired to
0 is essential, so not all ISAs have a zero register.
There's a few potential ways that this can improve performance; it's not clear which ones apply to that particular processor, but I've listed them roughly in order from most to least likely.
It avoids spurious pipeline stalls. Without an explicit zero register, it's necessary to take a register, zero it out, and use its value. This means that the zero-using operation is dependent on the zeroing operation, and (depending on how powerful the pipeline forwarding system is) possibly on the zeroed register's previous value. Architectures like x86, which have quite small register files and basically virtualize their registers to keep that from causing problems, have extremely powerful hazard analysis tools. The same is not generally true of RISC processors.
Certain operations may be more pipelineable if they can avoid a register read. If an explicit zero register is used, the fact that the operand will be zero is known at the instruction decode stage, rather than later on in the register fetch stage. Thus, the register read stage can be skipped.
Similarly, the ability to explicitly discard results avoids the need for a register write stage.
Certain operations may generate simpler microcode when one of their operands is known to be zero, or when the result is known to be discarded.
An explicit zero register takes some pressure off the compiler's optimizer, as it doesn't need to be as careful with its register assignment (no need to identify a register which won't cause a stall on read or write).
For each of your items, here's an answer.
Consider instructions that compulsory take a register for output, where you want to discard this output. Normally, you'd have to make sure that you have a free register available, and if not, push some of your current registers onto the stack, which is a costly operation. Evidently, it happens a lot that the output of operations is discarded, and the easiest way to deal with this is to have a 'unused' register available.
Now that we have such an unused register, why not use it? It happens a lot that you want to zero-initialize something or compare something to zero. The long way is to first write zero to that register (which requires an extra instruction and the literal for zero in your machine code, which may be of the form 0x00000000 which is rather long) and then use it. So, one instruction shaved off and a little bit of your program size as well.
These optimizations may seem a bit trivial and may raise the question 'how much does that actually improve anything?' The answer here is that the operations described above are apparently used a lot on your MIPS processor.
The concept of a zero register is not new. I first encountered it on a CDC 6600 mainframe, which dates back to the mid-to-late 1960's. In some ways it was one of the first RISC processors, and was the world's fastest computer for 5 years. In that architecture, the "B0" register was hardwired to always be zero. http://en.wikipedia.org/wiki/CDC_6600
The benefit of such a register is primarily that it simplified the instruction set. When the decoding and orchestration of simple and regular instruction sets can be implemented without microcode, it increases performance. In addition, for the 6600 like most LSI chips today, the time spent for a signal to travel the length a "wire" becomes on of the key factors in execution speed, and keeping the instruction set simple (and avoiding microcode) allows less transistors, and results in shorter circuit paths.
A zero register allows saving some opcodes when designing a new
instruction set architecture (ISA).
For example, the main RISC-V spec has 32 pseudo-instructions that
depend on the zero register (cf. Tables 26.2 and 26.3). A pseudo-instruction is an
instruction that is mapped by the assembler to another real
instruction (for example, branch-if-equal-to-zero is mapped to
branch-if-equal). For comparison: the main RISV-V spec lists 164
real instruction opcodes (i.e. counting RV(32|64)[IMAFD] base/extensions, a.k.a. RV64G). That means without a zero register RISC-V RV64G would occupy 32 more opcodes for those instructions (i.e. 20 % more). For a concrete RISC-V CPU
implementation, this real-to-pseudo instruction ratio may shift in either direction
depending on which extensions are selected.
Having less opcodes simplifies the instruction decoder.
A more complex decoder needs more time for decoding instructions
or occupies more gates (that can't be used for more useful CPU units)
or both.
Existing, incrementally developed ISAs have to deal with
backwards-compatibility. Thus, if your original ISA design
doesn't include a zero register, you can't just add it in a later
revision without breaking compatibility. Also, if your existing
ISA already requires a very complex decoder, adding then a zero
register doesn't pay off.
Besides the modern RISC-V ISA (developed since 2010, first
ratification in 2019), ARMv8 AArch64 (a 64 Bit ISA released in 2011),
in contrast to the previous ARM 32 bit ISAs, also features a zero register. Because of this and other changes
AArch64 ISA has much less in common with previous ARM 32 Bit
ISAs than - say - x86 and x86-64 ISAs.
In contrast to AArch64, x86-64
doesn't has a zero register. Although x86-64 is more modern than
the previous 32 bit x86 ISA, its ISA only changed incrementally.
Thus, it features all the existing x86 opcodes plus 64 bit
variants, and thus the decoder already is very complex.

Is it possible that in MIPS an instruction's certain steps come before that of its predecessor in a pipelined structure?

This is a problem about computer architecture and hope somebody has a clue. More specifically, it is about MIPS instruction pipelined flow. But I feel obscured about some aspects of it. Because I currently do not have enough reputation so I cannot post a image.
Does an S (stall) mean no following instructions can utilize the time slot taken by the stall?
Can two consecutive instructions both have X (execute) in the same time slot?
Is it possible that the M (Memory Access) and W (Write Back) of an instruction come before that of its predecessor in a pipelined structure????
In the situation of a loop and the last instruction is a repetition of the first instruction, why there are 2 F's (fetch) in the last instruction?
For issue 1, in a simple, scalar pipeline, a stall introduces a pipeline bubble which cannot be "popped". To allow an instruction later in program order to fill a pipeline bubble, that instruction would have to go past the stalled instruction. Supporting such reordering of instructions increases the complexity of the pipeline, which tends to increase design and production costs and to increase either pipeline depth or cycle time (as well as use more energy per active cycle [out-of-order execution can be more energy efficient in total even when more energy is used when active]). The mechanisms needed to support such reordering also increases the complexity of explaining pipelines.
For issue 2, with a more complex pipeline it is possible to begin execution of more than one instruction at the same time. Such processors are called superscalar. With in-order execution, only instructions in a consecutive sequence (in program order) can begin execution at the same time, and this requires that the instructions do not have data dependencies and that sufficient hardware resources are available to execute the instructions and handle their results. For an in-order microarchitecture, the width of the earlier pipeline stages is typically the same as the width of later pipeline stages, though buffering would allow multiple instructions to accumulate behind a stall.
(Even at only two-wide execution, there are usually additional restrictions on what kinds of instructions can be executed in parallel. E.g., one execution port might not handle memory accesses or branches while the other execution port might handle those instructions but not shifts or multiplies. Having two copies of hardware for relatively expensive operations [like shifts and multiplies] increases size and limiting the data paths for memory accesses and branches can simplify design and potentially reduce delay.)
For issue 3, out-of-order execution allows the reordering of instructions, so an instruction later in program order could execute and writeback results to the register file before an earlier instruction. With some additional complexity in handling exceptions/interrupts and arbitrating register write port use (or increasing the number of write ports), it is also possible for an in-order processor to writeback results out of program order. The Motorola 88110 (from the early 1990s) is an example of a processor which did such. In order to handle exceptions, the 88110 had a history buffer to hold data that is overwritten by instructions that are later in program order than where the exception is. The 88110 had two additional read ports to each of the register files to read the data in the destination registers and write such to the history buffer.
For issue 4, I am guessing that you mean the case where the body of the loop is composed on only one instruction. For a typical RISC instruction set the branch instruction controlling the loop is a separate instruction from the instruction performing a computation or memory access, so the loop would actually contain two instructions. (Power, formerly PowerPC, could have a one instruction delay loop using branch on counter which decrements the special counter register, but optimizing instruction fetch for a simple implementation for such peculiar code would be foolish.)
For the simple classic 5-stage pipeline with delayed branches, it does not make sense from a performance perspective to avoid an instruction cache access since the loop branch does not introduce a pipeline bubble even when taken. This means that there is no opportunity to execute more instructions. However, in some microarchitectures where redirecting instruction fetch to a non-sequential address introduces a pipeline bubble (particularly if from instruction fetch taking more than one cycle), providing a small fast-access buffer can improve performance. (Instruction fetch bandwidth limitations could also justify a buffer for performance; a small buffer could provide higher bandwidth than a large cache or an off-chip memory.) In addition, to reduce energy use, the use of a loop buffer makes considerable sense, but one would almost certainly not want to limit the size of the buffer to only two instructions (the branch plus one "body" instruction) because such tiny loops are rare and even increasing the buffer size to eight instructions would only add a modest amount of hardware.
In order to specially handle the case of small bodied loops, such loops must be detected. While the buffer could always be filled with the last N instructions (to avoid the first encounter of the short backward branch not "hitting" in the loop buffer--and such a buffer could also even out variations in instruction fetch which might be caused by crossing cache line boundaries, cache misses, fetch redirection delays, etc.), it would be necessary to check each branch instruction to see if it targeted an instruction within the buffer. (It would even be possible to provide a special storage for the loop branch instruction since storage is only needed for the condition checked, a small index into the loop buffer and an indication of where the branch is, but short loops are probably not sufficiently common for such specialized hardware.) In effect, a loop buffer can be a very small Level 0 instruction cache
(A branch target instruction cache [BTIC] is a mechanism similar to a loop buffer, but instead of caching instructions only from the target of the most recent loop branch a BTIC caches instructions from the targets of a number of recent branches. BTICs are primarily used to hide instruction fetch latency.)
When teaching pipelines, such complicating factors are usually avoided initially.