Is it possible that in MIPS an instruction's certain steps come before that of its predecessor in a pipelined structure? - mips

This is a problem about computer architecture and hope somebody has a clue. More specifically, it is about MIPS instruction pipelined flow. But I feel obscured about some aspects of it. Because I currently do not have enough reputation so I cannot post a image.
Does an S (stall) mean no following instructions can utilize the time slot taken by the stall?
Can two consecutive instructions both have X (execute) in the same time slot?
Is it possible that the M (Memory Access) and W (Write Back) of an instruction come before that of its predecessor in a pipelined structure????
In the situation of a loop and the last instruction is a repetition of the first instruction, why there are 2 F's (fetch) in the last instruction?

For issue 1, in a simple, scalar pipeline, a stall introduces a pipeline bubble which cannot be "popped". To allow an instruction later in program order to fill a pipeline bubble, that instruction would have to go past the stalled instruction. Supporting such reordering of instructions increases the complexity of the pipeline, which tends to increase design and production costs and to increase either pipeline depth or cycle time (as well as use more energy per active cycle [out-of-order execution can be more energy efficient in total even when more energy is used when active]). The mechanisms needed to support such reordering also increases the complexity of explaining pipelines.
For issue 2, with a more complex pipeline it is possible to begin execution of more than one instruction at the same time. Such processors are called superscalar. With in-order execution, only instructions in a consecutive sequence (in program order) can begin execution at the same time, and this requires that the instructions do not have data dependencies and that sufficient hardware resources are available to execute the instructions and handle their results. For an in-order microarchitecture, the width of the earlier pipeline stages is typically the same as the width of later pipeline stages, though buffering would allow multiple instructions to accumulate behind a stall.
(Even at only two-wide execution, there are usually additional restrictions on what kinds of instructions can be executed in parallel. E.g., one execution port might not handle memory accesses or branches while the other execution port might handle those instructions but not shifts or multiplies. Having two copies of hardware for relatively expensive operations [like shifts and multiplies] increases size and limiting the data paths for memory accesses and branches can simplify design and potentially reduce delay.)
For issue 3, out-of-order execution allows the reordering of instructions, so an instruction later in program order could execute and writeback results to the register file before an earlier instruction. With some additional complexity in handling exceptions/interrupts and arbitrating register write port use (or increasing the number of write ports), it is also possible for an in-order processor to writeback results out of program order. The Motorola 88110 (from the early 1990s) is an example of a processor which did such. In order to handle exceptions, the 88110 had a history buffer to hold data that is overwritten by instructions that are later in program order than where the exception is. The 88110 had two additional read ports to each of the register files to read the data in the destination registers and write such to the history buffer.
For issue 4, I am guessing that you mean the case where the body of the loop is composed on only one instruction. For a typical RISC instruction set the branch instruction controlling the loop is a separate instruction from the instruction performing a computation or memory access, so the loop would actually contain two instructions. (Power, formerly PowerPC, could have a one instruction delay loop using branch on counter which decrements the special counter register, but optimizing instruction fetch for a simple implementation for such peculiar code would be foolish.)
For the simple classic 5-stage pipeline with delayed branches, it does not make sense from a performance perspective to avoid an instruction cache access since the loop branch does not introduce a pipeline bubble even when taken. This means that there is no opportunity to execute more instructions. However, in some microarchitectures where redirecting instruction fetch to a non-sequential address introduces a pipeline bubble (particularly if from instruction fetch taking more than one cycle), providing a small fast-access buffer can improve performance. (Instruction fetch bandwidth limitations could also justify a buffer for performance; a small buffer could provide higher bandwidth than a large cache or an off-chip memory.) In addition, to reduce energy use, the use of a loop buffer makes considerable sense, but one would almost certainly not want to limit the size of the buffer to only two instructions (the branch plus one "body" instruction) because such tiny loops are rare and even increasing the buffer size to eight instructions would only add a modest amount of hardware.
In order to specially handle the case of small bodied loops, such loops must be detected. While the buffer could always be filled with the last N instructions (to avoid the first encounter of the short backward branch not "hitting" in the loop buffer--and such a buffer could also even out variations in instruction fetch which might be caused by crossing cache line boundaries, cache misses, fetch redirection delays, etc.), it would be necessary to check each branch instruction to see if it targeted an instruction within the buffer. (It would even be possible to provide a special storage for the loop branch instruction since storage is only needed for the condition checked, a small index into the loop buffer and an indication of where the branch is, but short loops are probably not sufficiently common for such specialized hardware.) In effect, a loop buffer can be a very small Level 0 instruction cache
(A branch target instruction cache [BTIC] is a mechanism similar to a loop buffer, but instead of caching instructions only from the target of the most recent loop branch a BTIC caches instructions from the targets of a number of recent branches. BTICs are primarily used to hide instruction fetch latency.)
When teaching pipelines, such complicating factors are usually avoided initially.

Related

What does the TargetWrite/IorD Control Line do on a multicycle MIPS processer

We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...
First, let's note that this block diagram does not have separate instruction memory and data memory.  That means that it either has a unified cache or goes directly to memory.  Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache).  The advantage of this is that the processor can read instructions and read/write data in parallel.  In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from.  For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do.  Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment.  Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC.  However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed?  It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not.  In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction.  In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC.  Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.

Cache Implementation in Pipelined Processor

I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.
But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.

How are nested branches handled on the Pascal architecture?

While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…

Is that true if we can always fill the delay slot there is no need for branch prediction?

I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?
For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).
The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)
within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.
Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).
Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.