MIPS pipeline registers length (IF/ID, ID/EX, EX/MEM, MEM/WB) - mips

I am currently studying for my Computer Architecture exam and came across a question that asks to illustrate (bit by bit i would assume) the values contained in the mips pipeline architecture after the 3rd stage of the sub (before the clock commutes) given the following instructions.
add $t0,$t1,$t2
sub $t3,$t3,$t5
beq $t6,$t0,16
add $t0,$t1,$t3
I am not asking for the solution to this problem however after some research i haven't had much success wrapping my mind around it so i am asking for some help/advice.
Firstly i still don't have a clear understanding of the size of the pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB). I do understand that they contain the control unit codes for the next stages and that they contain the result of the previous stage so that it can be passed in to the next one.
So that would be (please correct me if i'm wrong) +9 for ID/EX, +5 for EX/MEM and +2 for MEM/WB but i haven't managed to find a clear schema of the data that we can expect these registers to contain.
Also, i figure that we would need to use HW forwarding to forward the result of the first add to beq (because of $t0) and to forward the result of sub to the last add (because of $t3). Does this factor in to what is contained in the registers?
It would be great if someone could point me in the right direction.
Thanks lots.

The purpose of each of these intermediate registers is to hold data that might be needed in the immediate next stage or in later stages. I'll discuss one possible design, but there are really many possible designs as I'll explain.
In the fetch stage, the next instruction to be execute (to which the current PC points) is fetched from memory and PC is updated to point to the next instruction to fetch. Therefore, IF/ID would include one 4-byte field to hold the fetched instruction. There are two ways to calculate the new PC: current PC + 4 or PC + 4 + offset in case of a branch. If the fetched instruction is itself a branch instruction, then we would need to pass the new PC so that the branch target address can be calculated in the EX stage. We can add a 4-byte field in IF/ID to hold the new PC value to be passed to the EX stage through the ID stage.
In the decode stage, the opcode and its operands are determined. The opcode is at a fixed location in the instruction in MIPS. An MIPS instruction may operate on a single source register, two source registers, one source register and a sign-extended 32-bit immediate value, a sign-extended 32-bit immediate value, or no operands. We can either prepare only the required operands for the EX stage based on the opcode or prepare all the operands that might be required for any opcode. The latter design is simpler, but it requires a larger ID/EX register. In particular, two 4-byte fields are required to hold two possible source register values (the values are read from the register file in the decode stage) and a 4-byte field for the possible sign-extended immediate value. No opcode will require all of these fields, but let's prepare all of them anyway and store them at fixed locations in the ID/EX register. It simplifies the design.
We to also pass the new PC value calculate in the fetch stage to the execute stage just in case the opcode turns out to be a branch. The branch target address is calculated relative to the current PC value (the PC of the instruction following the branch in static program order). There are two possible design here: either add a bus from the new PC field in IF/ID to the EX stage or add a field in ID/EX to hold the new PC value, which can then be accessed in the EX stage. The latter design adds a 4-byte field in ID/EX.
The EX requires the opcode from the ID stage. We can choose to pass only the opcode rather than the whole instruction. But then later stages might require other parts of the instruction. Generally, in RISC pipelines, it preferable to pass to make the whole instruction available to all stages. In this way, all parts of an instruction are already available when changes are made to any stage of the pipeline in the future. So let's add a 4-byte field to ID/EX to hold the instruction.
The EX stage reads the operands and the opcode from the ID/EX register (the opcode is part of the instruction) and performs the operation specified by the opcode. The EX/MEM register has to be big enough to hold all possible results, which might include the following: a 4-byte value computed by the ALU resulting from an arithmetic or logic operation, a 4-byte value representing the calculated effective address for a memory load or store operation, a 4-byte value representing the branch target address in case of a branch instruction, and a 1-bit condition in case of a conditional branch instruction. We can use a single 4-byte field in EX/MEM for the result (whatever it represents) and a 1-bit field for the condition. In addition, as before, we need a 4-byte field to hold the instruction. Also for store instructions, we need another 4-byte field to hold the value to be stored. One possible alternative design here is that rather than storing the 1-bit condition and 4-byte branch target address in EX/MEM, they can be passed directly to the IF stage.
In the MEM stage, in case of a branch instruction, the branch target address and the branch condition are passed back from EX/MEM to the IF fetch to determine the new PC. In case of a memory store operation, the operation is performed and there is no result to be passed to any stage. In case of a memory load operation, the 4-byte value is fetched from memory and stored in a field in the MEM/WB register. In case of an ALU operation, the 4-byte result will be just passed to a field in the MEM/WB register. In addition, as before, we need a 4-byte field in MEM/WB to hold the instruction.
Finally, in the WB stage, the 4-byte result whether loaded from memory or computed by the ALU is stored in the destination register. This only occurs for instructions that produce results. Otherwise, the WB stage can be skipped.
In summary, in the design I've discussed, the sizes of intermediate registers are as follows: IF/ID is 8 bytes in size, ID/EX is 20 bytes in size, EX/MEM is 25 bits in size, and MEM/WB is 8 bytes in size.
The design decision of whether a field is required in an intermediate register to hold some value or whether it can be passed directly in the same stage to the logic that requires the value is a "circuit-level" decision. If the signals can be guaranteed to not be corrupted, and if it feasible or convenient to add a dedicated bus, they can be directly connected.

Related

Why A and B registers are used in multicycle Datapath?

Why are registers A and B whose inputs are ReadData1 and ReadData2 of RegisterFile are necessary? Isn't it possible to use directly the values which are on ReadData1 and ReadData2 outputs of Register File?
Instruction Register is already loaded with an instruction, so the value of IR is fixed, which means that $rs, $rt, $rd reg numbers are obviously the same within a single instruction. Hence, there are always the same values on the ReadRegister1 and ReadRegister2 inputs of the Register File. So the values that are on RD1 and RD2 outputs are the same unless the corresponding registers inside RegisterFile are overwritten.
That means that A and B registers are necessary only for the instructions that require to have the values of $rs or $rd registers that were overwritten on previous cycle. Can anybody give me an example of such an instruction.
The general pattern is that during a clock cycle: at the start of the clock, some register(s) feed values to computational logic which feed values to (the same or other) register(s) by the end of the clock, so that it can start all over again for the next cycle.
In the single cycle datapath, the value in the PC starts the process of the cycle, and by the end of the cycle, the PC register is updated to repeat a new cycle with another value.  Along the way, the register file is both consulted and also (potentially) updated.  You may note that these A & B registers are not present in the single cycle datapath.
You are correct that those values do not change during the execution of any one single instruction on the multicycle datapath.
However, the multicycle processor uses multiple cycles to execute a single instruction (so that it can speed the clock).  In order to support the successive cycles in that processor design, some internal registers are used — they capture the output of a prior cycle in order for a next cycle to do something different.
The problem with the multicycle datapath diagrams is that they don't make it clear what part of the processor runs in what cycles.  Those A & B registers are there to support a cycle boundary, so the decode is happening in one cycle and the arithmetic/ALU in another cycle.  (Without those registers, the processor would have to perform decode again in the subsequent cycle, which would decrease the clock rate and defeat the nature of the multicycle datapath.)
Boundaries are made much more clear in pipeline datapath diagrams.  Search for "MIPS pipeline datapath".  (Note that some pipeline datapath diagrams show registers between stages and others simply outline what's in what stage without showing those registers.)  The large vertical bars are registers and they separate the pipeline stages.  Of course, the pipeline processor executes all pipeline stages in parallel, though in theory, similar boundaries are applicable to the cycles in a multicycle processor.  Note that the ID/EX pipeline register in the pipeline datapath serves the purpose of the A & B registers in the multicycle datapath.

Why MIPS doesn't take additional function arguments in $v0 and $v1

According to the MIPS documentation, functions output is stored in $v0-$v1 (up to 64 bits), and the function arguments are given in $a0-$a3, where any additional arguments are written to the stack.
Since the function is allowed to overwrite the values of $v0-$v1, wouldn't it be better to pass the function fifth argument (if such exist) on $v0?
What is the motivation for using the stack in this case?
You are right that the $v registers are available to be used to pass parameters.
MIPS has, at times, updated the calling convention, for example: the "MIPS EABI 32-bit Calling Convention", redefines 4 of the original $t registers, $8-$11, as additional argument registers, to pass up to 8 integer arguments in total.
We might also consider that $at aka $1 — the assembler temp — is also available at the point for parameter passing.
However, object model invocations, e.g. those involving vtables, thunks and other stubs such as long calls, perhaps cross library (DLL) calls, can require an available register or two that are scratch, so it would not necessarily be best to use every one of the scratch registers for arguments.
Discussion
In general, other than that I'm not sure why they don't just get rid of most of the $t registers (and $v registers) and make them all $a registers — these would only be used when needed, and otherwise those unused argument registers would serve the same purpose as $t registers.  The more parameters, the fewer scratch registers — though in both caller and callee — but I think tradeoff can be made instead of guaranteeing some larger minimum number of scratch registers as in current ABIs.
Still, without some bare minimum number of scratch registers, you would sometimes end up using memory, spilling already computed arguments to memory in order to have free registers to compute the last couple of parameters, only to have to reload those spilled values back into registers.  If that were to happen, might as well have passed some of them in memory in the first place, especially since the callee may also have to store some of the arguments to memory anyway (e.g. the callee is not a leaf function, and parameters are needed after further calls).
8 argument registers is probably already on the tapering end of the curve of usefulness, so past thereabouts adding more probably has negligible returns on real code bases.
Also, a language can invent/define its own calling convention: these calling conventions are the standard for C language interoperability.  Even the C compiler can use custom calling conventions when it is certain that such language interoperability is not required, as we can also do in assembly when we know more details about function implementations (i.e. their internal register usages) than just the function signature.
Nicely collected set details on various calling conventions:
https://www.dyncall.org/docs/manual/manualse11.html
Addendum:
Let's assume a machine with only 2 registers, call them A & B, and it uses both to pass parameters.  Let's say a first parameter is computed into A (using B register as scratch if needed).  In computing the value of the 2nd parameter, for B, it may run out of scratch registers, especially if the expression for that actual argument is complicated.  When out of registers, we spill something to memory, here let's say, the already computed A.  Now the parameter for B can be computed with that extra register. However, the A parameter value, now in memory, needs to return back to the A register before the call.  Thus, this is worse than passing A in memory b/c the caller has to do both a store and a load, whereas passing in memory means just the store.
Now add to that situation that the callee may have to store the parameter to memory as well (various possible reasons).  That means another store to memory.  So, in total, if the above scenario coincides with this one, then a store, a load and another store — contrasted with memory parameter passing, which would have just the one store by the caller.

Mips datapath procedure for executing an AND instruction?

Based on this figure, executing the AND instruction would cause these values to be assigned to the signals labeled in blue:
RegWrite = 1
ALUSrc = 0
ALU operation = 0000
MemRead = 0
MemWrite = 0
MemtoReg = 0
PCSrc =0
However, I am a little confused which inputs will be used in the Registers block? Can anyone describe the overall AND procedure in the MIPS datapath?
Starting from after the instruction is read from instruction memory, you need to know that AND is an r-type instruction and thus uses 3 registers. Which register is actually used is based off of the encoded instruction. (An R-Type has 3 5-bit fields, one for each reg.) rs and rt go to Read register 1 and 2, while rd goes to Write register. From there, Read data 1 and 2 (the bits of registers s and t) go to the ALU where a bitwise AND is performed on them. The result of that is written to the write register. I traced the path in your picture (omitting the PC incrementing part). I'm taking a class that uses that exact book this semester. If you look a little ahead, it goes deeper into what is going on. The explanation of the control (blue) lines helps a lot. The mux blocks are multiplexers, that is they allow alternating the output between two inputs. In this case, the ALUSrc mux will use Read data 2 because AND is an r-type. If it were i-type, it would switch to use the data coming from the sign extend, because that would be the immediate. The other mux is to allow either memory from data to be written to the write register or the result of an ALU operation. In this case, it will be the result of an ALU operation.
To imply answer your question about the register block, keep in mind that the inputs to the register block are the addresses of the registers your instruction will be using, the register block then either fetches the data in the registers who's addresses were given or write data at the end on this register.
However one note you have an inconsistency in your mux design MemtoReg and ALUSrc should have opposite values, so unless one of the 2 muxes is upside down (which is not advisable) then there is a mistake with your controller logic.

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

Designing A Simplified MIPS Processor

So I'm going over some old quizzes for my Computer Organization final and I must have missed this lecture or something. I'm decently proficient in programming MIPS, but this problem has me completely stumped. Could someone help me understand this?
The diagram is missing lines connecting the various parts of the processor as well as a multiplexor for determining if the next instruction address is coming from the PC+4 or from a register as in a jr ra instruction.
There needs to be a line from the ALU to the write data portion of the register file. This is for R-Type instructions, as their result will need to be written back to the destination register. Going into the ALU needs to be lines from Read data 1 and Read data 2, this is how the values from the registers make their way into the ALU for R-type instructions.
A couple lines have been added from the instruction memory to the registers, you're missing the one to the Write Register though (this specifies the destination register for R-type instructions).
For the PC, the line going into the adder goes into the other input (the above one). The 4 for the adder is constant, as each instruction address is 4 bytes after the previous one, so unless we're jumping to an address we will be executing the instruction immediately after the current instruction. The line from the PC to the read address is also necessary as the PC specifies which instruction address to find the current instruction at. The line going into the PC comes from either the result of the PC+4 addition or from the register specified in a jr ra instruction.
To handle this decision a multiplexor is needed. Multiplexors have two inputs and one output, so this one will have one input from the PC+4 adder (for regular R-type instructions) and another from Read Data 1 (for jr ra instructions). The input from Read Data 1 should be visualized as a split from the line between Read Data 1 and the ALU. The output will go right back to the PC, as it determines the next instruction to execute.
I think that's everything that's wanted for that question, as the prof specifies that control signals are already generated (the multiplexor is a type of control unit, but I think it's necessary nonetheless). Hope that helps!