Well looks too simple a question to be asked but i asked after going through few ppts on both.
Both methods increase instruction throughput. And Superscaling almost always makes use of pipelining as well. Superscaling has more than one execution unit and so does pipelining or am I wrong here?
Superscalar design involves the processor being able to issue multiple instructions in a single clock, with redundant facilities to execute an instruction. We're talking about within a single core, mind you -- multicore processing is different.
Pipelining divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different "phases" each clock.
They're almost always used together. This image from Wikipedia shows both concepts in use, as these concepts are best explained graphically:
Here, two instructions are being executed at a time in a five-stage pipeline.
To break it down further, given your recent edit:
In the example above, an instruction goes through 5 stages to be "performed". These are IF (instruction fetch), ID (instruction decode), EX (execute), MEM (update memory), WB (writeback to cache).
In a very simple processor design, every clock a different stage would be completed so we'd have:
IF
ID
EX
MEM
WB
Which would do one instruction in five clocks. If we then add a redundant execution unit and introduce superscalar design, we'd have this, for two instructions A and B:
IF(A) IF(B)
ID(A) ID(B)
EX(A) EX(B)
MEM(A) MEM(B)
WB(A) WB(B)
Two instructions in five clocks -- a theoretical maximum gain of 100%.
Pipelining allows the parts to be executed simultaneously, so we would end up with something like (for ten instructions A through J):
IF(A) IF(B)
ID(A) ID(B) IF(C) IF(D)
EX(A) EX(B) ID(C) ID(D) IF(E) IF(F)
MEM(A) MEM(B) EX(C) EX(D) ID(E) ID(F) IF(G) IF(H)
WB(A) WB(B) MEM(C) MEM(D) EX(E) EX(F) ID(G) ID(H) IF(I) IF(J)
WB(C) WB(D) MEM(E) MEM(F) EX(G) EX(H) ID(I) ID(J)
WB(E) WB(F) MEM(G) MEM(H) EX(I) EX(J)
WB(G) WB(H) MEM(I) MEM(J)
WB(I) WB(J)
In nine clocks, we've executed ten instructions -- you can see where pipelining really moves things along. And that is an explanation of the example graphic, not how it's actually implemented in the field (that's black magic).
The Wikipedia articles for Superscalar and Instruction pipeline are pretty good.
A long time ago, CPUs executed only one machine instruction at a time. Only when it was completely finished did the CPU fetch the next instruction from memory (or, later, the instruction cache).
Eventually, someone noticed that this meant that most of a CPU did nothing most of the time, since there were several execution subunits (such as the instruction decoder, the integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only one of them busy at a time.
Thus, "simple" pipelining was born: once one instruction was done decoding and went on towards the next execution subunit, why not already fetch and decode the next instruction? If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! Of course, this only works flawlessly when there are no conditional jumps in the code (this led to a lot of extra effort to handle conditional jumps specially).
Later, with Moore's law continuing to be correct for longer than expected, CPU makers found themselves with ever more transistors to make use of and thought "why have only one of each execution subunit?". Thus, superscalar CPUs with multiple execution subunits able to do the same thing in parallel were born, and CPU designs became much, much more complex to distribute instructions across these fully parallel units while ensuring the results were the same as if the instructions had been executed sequentially.
An Analogy: Washing Clothes
Imagine a dry cleaning store with the following facilities: a rack for hanging dirty or clean clothes, a washer and a dryer (each of which can wash one garment at a time), a folding table, and an ironing board.
The attendant who does all of the actual washing and drying is rather dim-witted so the store owner, who takes the dry cleaning orders, takes special care to write out each instruction very carefully and explicitly.
On a typical day these instructions may be something along the lines of:
take the shirt from the rack
wash the shirt
dry the shirt
iron the shirt
fold the shirt
put the shirt back on the rack
take the pants from the rack
wash the pants
dry the pants
fold the pants
put the pants back on the rack
take the coat from the rack
wash the coat
dry the coat
iron the coat
put the coat back on the rack
The attendant follows these instructions to the tee, being very careful not to ever do anything out of order. As you can imagine, it takes a long time to get the day's laundry done because it takes a long time to fully wash, dry, and fold each piece of laundry, and it must all be done one at a time.
However, one day the attendant quits and a new, smarter, attendant is hired who notices that most of the equipment is laying idle at any given time during the day. While the pants were drying neither the ironing board nor the washer were in use. So he decided to make better use of his time. Thus, instead of the above series of steps, he would do this:
take the shirt from the rack
wash the shirt, take the pants from the rack
dry the shirt, wash the pants
iron the shirt, dry the pants
fold the shirt, (take the coat from the rack)
put the shirt back on the rack, fold the pants, (wash the coat)
put the pants back on the rack, (dry the coat)
(iron the coat)
(put the coat back on the rack)
This is pipelining. Sequencing unrelated activities such that they use different components at the same time. By keeping as much of the different components active at once you maximize efficiency and speed up execution time, in this case reducing 16 "cycles" to 9, a speedup of over 40%.
Now, the little dry cleaning shop started to make more money because they could work so much faster, so the owner bought an extra washer, dryer, ironing board, folding station, and even hired another attendant. Now things are even faster, instead of the above, you have:
take the shirt from the rack, take the pants from the rack
wash the shirt, wash the pants, (take the coat from the rack)
dry the shirt, dry the pants, (wash the coat)
iron the shirt, fold the pants, (dry the coat)
fold the shirt, put the pants back on the rack, (iron the coat)
put the shirt back on the rack, (put the coat back on the rack)
This is superscalar design. Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how to do it. In this case it resulted in a nearly 50% speed boost (in 18 "cycles" the new architecture could run through 3 iterations of this "program" while the previous architecture could only run through 2).
Older processors, such as the 386 or 486, are simple scalar processors, they execute one instruction at a time in exactly the order in which it was received. Modern consumer processors since the PowerPC/Pentium are pipelined and superscalar. A Core2 CPU is capable of running the same code that was compiled for a 486 while still taking advantage of instruction level parallelism because it contains its own internal logic that analyzes machine code and determines how to reorder and run it (what can be run in parallel, what can't, etc.) This is the essence of superscalar design and why it's so practical.
In contrast a vector parallel processor performs operations on several pieces of data at once (a vector). Thus, instead of just adding x and y a vector processor would add, say, x0,x1,x2 to y0,y1,y2 (resulting in z0,z1,z2). The problem with this design is that it is tightly coupled to the specific degree of parallelism of the processor. If you run scalar code on a vector processor (assuming you could) you would see no advantage of the vector parallelization because it needs to be explicitly used, similarly if you wanted to take advantage of a newer vector processor with more parallel processing units (e.g. capable of adding vectors of 12 numbers instead of just 3) you would need to recompile your code. Vector processor designs were popular in the oldest generation of super computers because they were easy to design and there are large classes of problems in science and engineering with a great deal of natural parallelism.
Superscalar processors can also have the ability to perform speculative execution. Rather than leaving processing units idle and waiting for a code path to finish executing before branching a processor can make a best guess and start executing code past the branch before prior code has finished processing. When execution of the prior code catches up to the branch point the processor can then compare the actual branch with the branch guess and either continue on if the guess was correct (already well ahead of where it would have been by just waiting) or it can invalidate the results of the speculative execution and run the code for the correct branch.
Pipelining is what a car company does in the manufacturing of their cars. They break down the process of putting together a car into stages and perform the different stages at different points along an assembly line done by different people. The net result is that the car is manufactured at exactly the speed of the slowest stage alone.
In CPUs the pipelining process is exactly the same. An "instruction" is broken down into various stages of execution, usually something like 1. fetch instruction, 2. fetch operands (registers or memory values that are read), 2. perform computation, 3. write results (to memory or registers). The slowest of this might be the computation part, in which case the overall throughput speed of the instructions through this pipeline is just the speed of the computation part (as if the other parts were "free".)
Super-scalar in microprocessors refers to the ability to run several instructions from a single execution stream at once in parallel. So if a car company ran two assembly lines then obviously they could produce twice as many cars. But if the process of putting a serial number on the car was at the last stage and had to be done by a single person, then they would have to alternate between the two pipelines and guarantee that they could get each done in half the time of the slowest stage in order to avoid becoming the slowest stage themselves.
Super-scalar in microprocessors is similar but usually has far more restrictions. So the instruction fetch stage will typically produce more than one instruction during its stage -- this is what makes super-scalar in microprocessors possible. There would then be two fetch stages, two execution stages, and two write back stages. This obviously generalizes to more than just two pipelines.
This is all fine and dandy but from the perspective of sound execution both techniques could lead to problems if done blindly. For correct execution of a program, it is assumed that the instructions are executed completely one after another in order. If two sequential instructions have inter-dependent calculations or use the same registers then there can be a problem, The later instruction needs to wait for the write back of the previous instruction to complete before it can perform the operand fetch stage. Thus you need to stall the second instruction by two stages before it is executed, which defeats the purpose of what was gained by these techniques in the first place.
There are many techniques use to reduce the problem of needing to stall that are a bit complicated to describe but I will list them: 1. register forwarding, (also store to load forwarding) 2. register renaming, 3. score-boarding, 4. out-of-order execution. 5. Speculative execution with rollback (and retirement) All modern CPUs use pretty much all these techniques to implement super-scalar and pipelining. However, these techniques tend to have diminishing returns with respect to the number of pipelines in a processor before stalls become inevitable. In practice no CPU manufacturer makes more than 4 pipelines in a single core.
Multi-core has nothing to do with any of these techniques. This is basically ramming two micro-processors together to implement symmetric multiprocessing on a single chip and sharing only those components which make sense to share (typically L3 cache, and I/O). However a technique that Intel calls "hyperthreading" is a method of trying to virtually implement the semantics of multi-core within the super-scalar framework of a single core. So a single micro-architecture contains the registers of two (or more) virtual cores and fetches instructions from two (or more) different execution streams, but executing from a common super-scalar system. The idea is that because the registers cannot interfere with each other, there will tend to be more parallelism leading to fewer stalls. So rather than simply executing two virtual core execution streams at half the speed, it is better due to the overall reduction in stalls. This would seem to suggest that Intel could increase the number of pipelines. However this technique has been found to be somewhat lacking in practical implementations. As it is integral to super-scalar techniques, though, I have mentioned it anyway.
Pipelining is simultaneous execution of different stages of multiple instructions at the same cycle. It is based on splitting instruction processing into stages and having specialized units for each stage and registers for storing intermediate results.
Superscaling is dispatching multiple instructions (or microinstructions) to multiple executing units existing in CPU. It is based thus on redundant units in CPU.
Of course, this approaches can complement each other.
Related
I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.
This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)
The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible
I've been asked to sketch a multicycle datapath and control unit for the instructions (js, sll, slti) all together. and draw the main controller FSM for these 3 instructions.
I'm struggle with it, I know how to make it with single cycle datapath but not for multicycle.
please, help
If you know the single cycle datapath, and want to take that to multi cycle, generally speaking we subdivide the single cycle into stages.
As only one stage will execute at a time, a relatively simple state machine controls what stage to execute currently/next to activate the appropriate stage, and deactivate the others.
The stages would be similar to the stages of a pipelined processor, namely those corresponding to Instruction Fetch, Decode, Execute, Memory, and Writeback.
In a pipelined machine, all the instructions need to go through all the stages, however, in a multicycle machine, certain instructions can skip certain states, and this should be manifest in the state machine. For example, while all instructions share Instruction Fetch and Decode, only loads and stores interact with the Data Memory, so any other instruction can skip the Mem stage.
The simplest possible state machine simply chooses all the states in succession, for every instruction, without ever consulting the instruction. This would fail to take advantage of the ability to skip stages for certain instructions that don't need those stages, and of course, you're being asked to do more than the simplest state machine.
A better state machine might start with Instruction Fetch as the active state, go to Decode state next, and by then it can use knowledge of the actual instruction to decide which of the remaining stages to skip for that given instruction.
You can imagine that the Control logic fetches some bits that inform the state machine of which stages can be skipped for any given opcode. There are only three states/stages left, so all that is needed from Control is a boolean for each.
For more information, especially on the stages and what they do, I suggest searching on Pipelined/Pipelining MIPS, as I believe there is more material on pipelined processors than on the multicycle design. You won't be concerned with pipeline hazzards, but the break down of the single cycle design into the stages might help.
I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?
For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).
The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)
within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.
As there are some instructions that are being used in MIPS Architecture, which doesn't require all 5 cycles for its successful completion, like the store instruction doesn't need to use 5th stage. So does the instruction will also pass through the stage or does it skip the stage?
In a multi-cycle CPU the each of the instructions can take a differing number of instructions.
As you suggested, one of the ways that this can occur is by having an instruction "skip" a pipe-line stage. This is accomplished by having a control unit direct execution of the CPU by having separate execution paths for necessary instructions.
Maybe take a look here for some further information on how a MIPS multi-cycle machine might be implemented.
Generally speaking though, you should take these sorts of explanations with a grain of salt. The sort of machine architecture we learn as non-hardware experts is often quite quaint in comparison to how complicated these things have gotten to the extent that our understandings are often several decades out of date.
5 R-type instructions are to be executed on our original 5-stage (scalar) pipelined processor. There are no dependencies of any kind among the instructions.
If this same instruction sequence is executed instead on a degree-2 super-pipelined version of our processor, what speedup factor would be provided if no other changes are made?
I know that a degree-2 super-pipelined system splits each stage into two phases so that the time required for the two phases is the same as the original clock cycle time. I think the speedup would be 2, but that seems off.
If the internal clock is multiplied by two, and assuming that there are no dependencies of any kind among the instructions, then you would see a speedup of factor 2 because the pipeline processes two instructions in every stage per outer clock cycle (the one not multiplied by factor 2)
The downside of this approach is that you have to provide an internal clock that is twice as fast, which makes the hardware design more complex.
Check this chapter for more details in the subject.