There are a multitude of different instructions in MIPS. I'm currently learning about data and instruction cache.
Instruction cache simply takes what it can so to say, depending on the block size it might utilize spatial locality and fetch multiple instructions. But for data cache I have a harder time understanding when it fetches things from main memory and when it doesn't.
For example, the instruction lw $t0, 0x4C($0) will fetch a word of data stored in address 0x4C and depending on data cache capacity, sets, block size and so forth it will temporarily store in in a block in the cache if for that adress the valid bit or tag doesn't exist there.
In my litterature, an addi instruction does not fetch from memory, why? The only times it seems to need to fetch data from memory is when using the lw instruction, why?
I also have a question regarding registers in MIPS. If we're simply doing the instructions over the registers, then there will be no access to any main memory, correct? It will not even go to the data cache, correct? Are the registers the highest level in the memory heirarchy?
The reason addi doesn't "fetch from memory" is that it's using an immediate operand, as in, the program counter has already fetched the value that's going to be loaded. (Technically it is fetching from memory, since all code resides in some form of memory, but when literature refers to "memory" typically it's referring to a range of memory outside the program counter. When MIPS uses something like lw to load from memory, the CPU has no idea what value the destination register will have until the load is finished.
Just to illustrate this concept further, the original MIPS I architecture (which was used by the PlayStation 1) actually wouldn't finish loading from memory before the next instruction was already being worked on!
lw $t0,0($a0) ;load from the address pointed to by $a0
addi $t0,$t0,5 ;the value in $t0 hasn't been updated yet so this won't have the desired result.
The easiest solution to this was to put a nop after every lw. Chances are the version of MIPS you're using doesn't have this problem, so don't worry about it.
Related
We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...
First, let's note that this block diagram does not have separate instruction memory and data memory. That means that it either has a unified cache or goes directly to memory. Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache). The advantage of this is that the processor can read instructions and read/write data in parallel. In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from. For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do. Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment. Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC. However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed? It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not. In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction. In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC. Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.
I'd really need a hand or two with this Assembly Mips CPU Excercise.
I have to determine input and output from: ALU(s), Jump-related MUX and from the Register File.
PC is 0x01D0 and the instruction I have to simulate is: beq $3, $7, -120
Regarding the ALU(s) I've no problem on those, I've got issues on MUX and RG.
As you can see on the image on the second jump-related MUX I don't know what to write regarding jump address [31-0].
The other problem I've got is within the Register File, I don't know what to write as input.(Instruction should be: 0x1067FFE2)
This is an abstract view of the implementation of the MIPS subset showing the
major functional units and the major connections between them
Why we need to add the result of (PC+4) with instruction address?
I know that the PC (Program Counter) is a register in a computer processor that contains the address (location) of the instruction being executed at the current time, but i didn't understand why we add the second adder in this picture?
Some of the operations that can be performed by the CPU are 'jumps'.
If your operation is a Jump, from the second block you get the address of the new instructions OR the lenght of the jump you have to do.
It's not the instruction address, the output of the instruction memory is an instruction itself.
They've obviously hidden most of the components (there's NO control circuitry). What they probably meant is the data path for branches, though they really should have put at least the link with the ALU output in there. Even so it would be better to explicitly decode the instruction, sign extend and shift left. So it's really inaccurate, but I don't see what else they could mean.
I understand that, given the latencies of say, IMem, Add , Mux , ALU , Regs, DMem and constrol , specific MIPS instruction such as add, and a specific datapath to work with, I am to find the critical path of the instruction on the datapath and add the latencies to come up with Clock Cycle Time. However, what if I am only given the latencies and the datapath, but no specific MIPS instruction? Do I just go with the longest single instruction and find its critical path? Or can I just add one instance of each individual latency to get a "general" clock cycle time?
Thanks for the help!
You need to use the latency of the slowest single cycle instruction, since the clock must run slow enough to complete that instruction correctly.
How many stalls do I need to execute the following instructions properly. I am a little confused with what I did, so I am here to see experts answers.
lw $1,0($2);
beq $1,$2,Label;
Note that the check whether the branch will occur or not will be done in decoding stage. But the source register rs of beq which is $1 in this case will be updated after writeback stage of lw instruction. So do we need to forward new data from Memory in memory stage to Decoding stage of beq instruction.
Here is the data path diagram:
The value that is fetched from the memory, is written to the register file in the write-back stage of the pipeline. Writes to the register file happen in the first half of the clock cycle, while reads from the register file happen in the second half of the clock cycle.
The value that is written to the register file can thus be read in the same clock cycle as it is written to the register file. Thus forwarding is not effective here.
As for the number of stalls needed, you need to insert two bubbles into the pipeline, as the lw instruction should be in the write back stage when the beq instruction is in the decode stage.
I hope this answers your question.