MIPS forwarding implementation (tough)

MIPS forwarding implementation (tough) - mips

I think I understand the first part
(i). I at least have answers for this. I am not sure about where this implementation would fail though, for part ii? Part ii has me completely stumped. Does anyone know situations where this would fail?
If you want to shine some light on part iii you would be my entire classes hero. Were all stumped there. Thanks for any input.
Tim FlimFlam, the infamous architect of the MN-4363 processor, is struggling with a pipelined implementation of the basic MIPS ISA.
(i) To implement forwarding, Tim connected the output of logic from EX and MEM stages (these logic outputs represent inputs to EXMEM and MEMWB latches, respectively) to the input of IDEX register. He claims that he will be able to cover any dependency in this manner.
• Would this implementation work?
• Would he need to insert any muxes? Explain for
1. the producer instruction is a load.
2. the producer instruction is of R-type. 3. the consumer instruction is of R-type. 4. the consumer instruction is a branch. 5. the consumer instruction is a store.
(ii) Tim claims that forwarding to EX stage only suffices to cover all dependencies.
• Provide two examples where his implementation would fail.
• Would “fail” in this case correspond to breaking correctness constraints?
(iii) Tim tries to identify the minimum amount of information to be transferred acros pipeline stages. Considering R-type, data transfer, and branch instructions, explain how wide each pipeline register should be, demarcating different fields per latch.

Not sure if this is late, but the answer rests in "all dependencies" in part 2. Dependencies/hazards are of multiple types, viz, control, data. Some data hazards can be fixed by forwarding (from the MEM && WB stages to execute stage. Other data hazards like LOAD dependency is not possible to fix by forwarding. To see why this happens, note that a LOAD instruction in the MEM stage will have the output ready from the memory only in the end of that clock cycle. In that same clock cycle, any intstruction in the execute stage which requires the value of the LOAD instruction will get the incorrect value. In such a scenario, at any instant of time within the clock cycle say beginning, the alu is beginning to execute while the memory is 'beginning' to fetch the data. At the end of the cycle, while the memory has finished fetching the data, the alu has also finished computing with the wrong values. To prevent hazards, you need alu to be beginning computing while the data memory has finished fetching (i.e the alu must stall for 1 cycle or you must have a nop between LOAD and ALU instrcution. Hope this helps!

Related

What does the TargetWrite/IorD Control Line do on a multicycle MIPS processer

We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...

First, let's note that this block diagram does not have separate instruction memory and data memory.  That means that it either has a unified cache or goes directly to memory.  Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache).  The advantage of this is that the processor can read instructions and read/write data in parallel.  In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from.  For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do.  Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment.  Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC.  However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed?  It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not.  In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction.  In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC.  Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.

implementing jump register (jr, sll, slti) in multicycle

I've been asked to sketch a multicycle datapath and control unit for the instructions (js, sll, slti) all together. and draw the main controller FSM for these 3 instructions.
I'm struggle with it, I know how to make it with single cycle datapath but not for multicycle.
please, help

If you know the single cycle datapath, and want to take that to multi cycle, generally speaking we subdivide the single cycle into stages.
As only one stage will execute at a time, a relatively simple state machine controls what stage to execute currently/next to activate the appropriate stage, and deactivate the others.
The stages would be similar to the stages of a pipelined processor, namely those corresponding to Instruction Fetch, Decode, Execute, Memory, and Writeback.
In a pipelined machine, all the instructions need to go through all the stages, however, in a multicycle machine, certain instructions can skip certain states, and this should be manifest in the state machine. For example, while all instructions share Instruction Fetch and Decode, only loads and stores interact with the Data Memory, so any other instruction can skip the Mem stage.
The simplest possible state machine simply chooses all the states in succession, for every instruction, without ever consulting the instruction. This would fail to take advantage of the ability to skip stages for certain instructions that don't need those stages, and of course, you're being asked to do more than the simplest state machine.
A better state machine might start with Instruction Fetch as the active state, go to Decode state next, and by then it can use knowledge of the actual instruction to decide which of the remaining stages to skip for that given instruction.
You can imagine that the Control logic fetches some bits that inform the state machine of which stages can be skipped for any given opcode. There are only three states/stages left, so all that is needed from Control is a boolean for each.
For more information, especially on the stages and what they do, I suggest searching on Pipelined/Pipelining MIPS, as I believe there is more material on pipelined processors than on the multicycle design. You won't be concerned with pipeline hazzards, but the break down of the single cycle design into the stages might help.

What's the role of EX stage for branching in Pipelined MIPS w Forwarding?

Consider the following Pipelined Processor structure:
Notice that the condition test for branching (the = circuit), as well as the target address calculation for the next instruction in case of branch taken are executed in the ID phase - as a way to save on stalls/flushes (as opposed to doing all that in the EX phase and forwarding the results in the MEM phase of the given branch instruction).
Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
Thank you in advance.

When the beq presents a control hazard, the pipelined processor does not know in advance what instruction to fetch next,because the branch decision has not been made by the time the next instruction is fetched.
Because the decision is made in the MEM-stage we need to stall the pipeline for three cycles at every branch which of course effect the system performance.
another way is to predict whether the branch will be taken and begin to executing instructions based on the prediction.Once the branch decision is made and is available, the processor can throw out(flushes) the instructions if the prediction was wrong (this called branch misprediction penalty) which also effect the performance.
To reduce the branch misprediction penalty one could make the branch decision made earlier.
Making the decision simply requires comparing two registers. using a dedicated equality comparator is faster than performing a subtraction and zero detection. If the comparator is fast enough, it could be moved back into the Decode stage, so that the operands are read from the register file and compared to determine the next PC by the end of the Decode stage.
Unfortunately the early branch decision hardware introduce a new RAW data hazard.

Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
The branch instruction is decoded and resolved in the Decode stage only and we do not wait for it to go to the EX stage.
Like you pointed out in the question, both the branch result and the target address calculation is done in the DEC stage. The hardware takes care of the RAW hazards by forwarding the required data from the correct stages (notice the little mux'es right after the RegFile is read). As a result the branch equality check sees the correct operands and the result drives the PCSrcD signal. This signal further decides the output of the first Mux in the diagram (which essentially decides between PC+4 or Branch Target. Hence it becomes safe and quick to do this in the DEC stage itself.
Also, none of the branch instruction related signals (PCSrcD, BranchD, PCBranchD) make it to the EX stage. If you see the inputs to the ISS/EX register, it doesn't take in any of the above mentioned signals. Hence, the information isn't passed the to EX stage and the branch is completely resolved and retired at the end of the DEC stage itself.

Is that true if we can always fill the delay slot there is no need for branch prediction?

I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?

For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).

The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)

within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.

Instruction Execution in MIPS

This is an abstract view of the implementation of the MIPS subset showing the
major functional units and the major connections between them
Why we need to add the result of (PC+4) with instruction address?
I know that the PC (Program Counter) is a register in a computer processor that contains the address (location) of the instruction being executed at the current time, but i didn't understand why we add the second adder in this picture?

Some of the operations that can be performed by the CPU are 'jumps'.
If your operation is a Jump, from the second block you get the address of the new instructions OR the lenght of the jump you have to do.

It's not the instruction address, the output of the instruction memory is an instruction itself.
They've obviously hidden most of the components (there's NO control circuitry). What they probably meant is the data path for branches, though they really should have put at least the link with the ALU output in there. Even so it would be better to explicitly decode the instruction, sign extend and shift left. So it's really inaccurate, but I don't see what else they could mean.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008