Pipelining in the data path is simply divvying/cutting (theoretically) the resources. But pipelining the control means each resource at piped stages gets the separate control signals?
For instance, in most of the RISC architectures, we have 5 stages of pipelining, and the Mem pipe stage has the separate control signal for load or store?
Are there some practical examples of control pipelining?
In a classic 5-stage pipeline, each stage of the pipe has inputs that come from the previous stage (except the first one, of course), and each stage of the pipe has outputs that go to the next stage (except the last one, of course). It stands to reason that these inputs & outputs are comprised of both data and control signals.
The EX stage needs to know what ALU operation to perform (control: ALUOp) and the ALU input operands (data).
The MEM stage needs to know whether to read memory (control: MemRead) or to write memory (control: MemWrite) (plus size & type for extension, usually glossed over) and where to read (data: Address) and what to write (data: Write Data).
The WB stage needs to know whether to write a register (control: RegWrite) and what register to write (data: Write Register) and what value to write to the register (data: Write Data).
In the single stage processor, all these control signals are generate by lookup (using the opcode) in the ID stage. When the processor is pipelined, either those signals are forwarded from one stage to another, or else, each stage would have to repeat lookup using the opcode (then opcode would need to be forwarded from one stage to another, in order for each stage to repeat the lookup, though it is possible that the opcode is forwarded anyway, perhaps for exceptions). (I believe that repeating the lookup in each stage would incur costs (time & hardware) as compared with forwarding control signals, especially for WB which is supposed to execute in the first half of a cycle.)
Because the WB stage needs to know whether to write a register, that information (control: RegWrite) must be passed to it from the MEM stage, which gets it from the EX stage, which gets it from the ID stage, where it is generated by lookup of the opcode. EX & MEM don't use the RegWrite control signal, but must accept it as an input so as to pass it through as output to the next stage.
Similar is true for control signals needed by MEM: MemRead and MemWrite, which are generated in ID, passed from EX to MEM (not used in EX), and MEM need not pass these further, since WB also doesn't use those signals.
If you look in chapter 4 of Computer Organization and Design RISC-V edition, towards the end of the chapter (Fig 4.44 in the 1st edition), it shows the control signals output from one stage passing through stage pipeline registers and into the next intermediate stage. For example, Instruction [30, 14-12] is fed into ID/EX and then read by ALU Control in the EX stage. That is an example of pipelining a control signal.
Related
According to the book Computer Organization and Design by Patterson and Hennessy (5th edition) page 304, the RegDst control signal is being used in the execution stage of the datapath.
It acts as the selection bit for the mux that chooses the destination register from one of the two register addresses passed in the ID/EX pipeline register.
Why the RegDst control signal and the associated mux are not put in the instruction decode stage (the stage immediately before the execution stage)? That way, we could have sent only the selected destination register address from [20-16] and [15-11] bits of the instruction in the ID/EX pipeline register, instead of sending both of them.
We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...
First, let's note that this block diagram does not have separate instruction memory and data memory. That means that it either has a unified cache or goes directly to memory. Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache). The advantage of this is that the processor can read instructions and read/write data in parallel. In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from. For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do. Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment. Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC. However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed? It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not. In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction. In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC. Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.
In PIPELINE, MEM (memory) and IF (instruction fetch) are the same hardware element?
If are the same memory then 2 instructions can't load or store in the same cycle clock, I'm right?
MIPS processor diagram
MEM (memory) and IF (instruction fetch) are the same hardware element?
No, there are not, because a) why would they then be drawn as separate blocks, and b) code loads (== fetches) are not the same as data loads. Code fetches are used to understand what a new instruction wants to do with data — the function, and loads/stores are acts of obtaining arguments of that function.
If are the same memory then 2 instructions can't load or store in the same cycle clock, I'm right?
Both load and store are done inside MEM, not IF, stage. Because there is only one MEM block on the diagram, at most one memory-related operation can be done at each clock. This does not mean that the IF stage is necessarily blocked by MEM. Whether instruction/data memories are separate, or there is an instruction cache, would define, but it is outside the scope of the diagram you showed.
Consider the following Pipelined Processor structure:
Notice that the condition test for branching (the = circuit), as well as the target address calculation for the next instruction in case of branch taken are executed in the ID phase - as a way to save on stalls/flushes (as opposed to doing all that in the EX phase and forwarding the results in the MEM phase of the given branch instruction).
Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
Thank you in advance.
When the beq presents a control hazard, the pipelined processor does not know in advance what instruction to fetch next,because the branch decision has not been made by the time the next instruction is fetched.
Because the decision is made in the MEM-stage we need to stall the pipeline for three cycles at every branch which of course effect the system performance.
another way is to predict whether the branch will be taken and begin to executing instructions based on the prediction.Once the branch decision is made and is available, the processor can throw out(flushes) the instructions if the prediction was wrong (this called branch misprediction penalty) which also effect the performance.
To reduce the branch misprediction penalty one could make the branch decision made earlier.
Making the decision simply requires comparing two registers. using a dedicated equality comparator is faster than performing a subtraction and zero detection. If the comparator is fast enough, it could be moved back into the Decode stage, so that the operands are read from the register file and compared to determine the next PC by the end of the Decode stage.
Unfortunately the early branch decision hardware introduce a new RAW data hazard.
Since all the work gets done in Instruction Decode stage, why bother waiting for the given branching instruction to reach the EX stage? Does the EX stage ALU unit have a role in this, somehow?
The branch instruction is decoded and resolved in the Decode stage only and we do not wait for it to go to the EX stage.
Like you pointed out in the question, both the branch result and the target address calculation is done in the DEC stage. The hardware takes care of the RAW hazards by forwarding the required data from the correct stages (notice the little mux'es right after the RegFile is read). As a result the branch equality check sees the correct operands and the result drives the PCSrcD signal. This signal further decides the output of the first Mux in the diagram (which essentially decides between PC+4 or Branch Target. Hence it becomes safe and quick to do this in the DEC stage itself.
Also, none of the branch instruction related signals (PCSrcD, BranchD, PCBranchD) make it to the EX stage. If you see the inputs to the ISS/EX register, it doesn't take in any of the above mentioned signals. Hence, the information isn't passed the to EX stage and the branch is completely resolved and retired at the end of the DEC stage itself.
I think I understand the first part
(i). I at least have answers for this. I am not sure about where this implementation would fail though, for part ii? Part ii has me completely stumped. Does anyone know situations where this would fail?
If you want to shine some light on part iii you would be my entire classes hero. Were all stumped there. Thanks for any input.
Tim FlimFlam, the infamous architect of the MN-4363 processor, is struggling with a pipelined implementation of the basic MIPS ISA.
(i) To implement forwarding, Tim connected the output of logic from EX and MEM stages (these logic outputs represent inputs to EXMEM and MEMWB latches, respectively) to the input of IDEX register. He claims that he will be able to cover any dependency in this manner.
• Would this implementation work?
• Would he need to insert any muxes? Explain for
1. the producer instruction is a load.
2. the producer instruction is of R-type. 3. the consumer instruction is of R-type. 4. the consumer instruction is a branch. 5. the consumer instruction is a store.
(ii) Tim claims that forwarding to EX stage only suffices to cover all dependencies.
• Provide two examples where his implementation would fail.
• Would “fail” in this case correspond to breaking correctness constraints?
(iii) Tim tries to identify the minimum amount of information to be transferred acros pipeline stages. Considering R-type, data transfer, and branch instructions, explain how wide each pipeline register should be, demarcating different fields per latch.
Not sure if this is late, but the answer rests in "all dependencies" in part 2. Dependencies/hazards are of multiple types, viz, control, data. Some data hazards can be fixed by forwarding (from the MEM && WB stages to execute stage. Other data hazards like LOAD dependency is not possible to fix by forwarding. To see why this happens, note that a LOAD instruction in the MEM stage will have the output ready from the memory only in the end of that clock cycle. In that same clock cycle, any intstruction in the execute stage which requires the value of the LOAD instruction will get the incorrect value. In such a scenario, at any instant of time within the clock cycle say beginning, the alu is beginning to execute while the memory is 'beginning' to fetch the data. At the end of the cycle, while the memory has finished fetching the data, the alu has also finished computing with the wrong values. To prevent hazards, you need alu to be beginning computing while the data memory has finished fetching (i.e the alu must stall for 1 cycle or you must have a nop between LOAD and ALU instrcution. Hope this helps!