I have this exercise related to correlated predictors that states the following:
A: BEQZ R1, D
…
D: BEQZ R1, F
…
F: NOT R1, R1
G: JUMP A
The prediction works like follows
fetch the current instruction
if it is a branch, determine the current state of the predictor and predict the branch:
a.row is determined by the branch address (in this case either A or D)
b.column is determined by the current global shift register
c.use the value in the cell to determine the prediction from the state machine (current state is saved in the cell)
Execute the branch, and determine the actual decision
(Taken: 1, Not Taken: 0):
a.update the cell based on the current state and the
actual decision
b.update the global shift register (shift left and add the actual decision bit to right)
goto step 1
This is the solution
Solved exercise
I understood the scheme and know that a 2 bit predictor means less errors but I cannot solve this question and I have trouble finding how the solution was found, any help would be appreciated.
This a variation of the briefly described Two-level adaptive predictor with global history table in the Agner Fog's microarchitecture paper (page 15).
In this variant, the history register is shared across all branches however the pattern history table is local to a branch1.
The outcome of the last n (n = 2, in your case) branches is remembered (0 = Not taken, 1 = Taken), ordered from left to right in chronological order, forming a value of n bit that is used, along with the branch address2, to index a table of 2-bit saturating counters.
Each counter is incremented if the branch is taken and decremented otherwise (this is the canonical implementation, any 4-state FSA will do).
The meaning of each counter value is:
00b (0) Strongly not taken
01b (1) Weakly not taken
10b (2) Weakly taken
11b (3) Strongly taken
Saturating means than 3+1 (A strongly taken branch is taken again) = 3 and that 0-1 (A strongly not taken branch is again not taken) = 0 while normally arithmetic on registers is modulo 2n.
In your exercise the assumptions are:
The pattern history table is given as a 2D-table with rows corresponding to the full address of the branch and columns to the value of the global history register.
All counters start in the state 01b (Weakly not taken).
The global history register is 00b at reset.
R1 is 0 at the beginning.
Let's see the first iteration only.
First iteration
The instruction is BEQZ R1, D (a branch, obviously), its address is A.
Since R1 is 0, the branch will be taken (towards D).
Indexing into the table with a global history of 00b and address A gives us the counter value 01b (Weakly not taken) thus the prediction is not taken.
Once the CPU has executed the branch and flushed the mispredicted stage, the table must be updated.
Since the branch was taken, the counter is incremented from 01b to 10b.
Finally, the global history goes from 00b to 01b since the branch is taken (a 1 is shifted in from the right).
Note that the yellow highlighted entries are those read when the corresponding instruction is executed, while the green ones are those updated by the previous prediction.
Thus to see that the counter value has been incremented you have to look at the next row.
Since the branch was taken, the CPU is at D (BEQZ R1, F), this is exactly the same as before, only the global history register has value 01b.
After this instruction is executed the CPU is at F, so R1 becomes 111..11b (the solution just indicates it as 1) and the two above branches are re-executed.
1This is a simplification, the table is almost always a cache. It's impractical to an entry for each possible memory address where a branch can be found.
2Part of the address is used as the index in the cache, once a set has been selected, the address is again compared with the tag in each way of the set.
Related
So i have been searching why on winmips64 there were 2 cycles stalls each time branch target buffer mispredict but i got nothing .So it mispredicts the first time and the last that bne is running .In the first time says its 2 cycles of branch taken stall and on last 2 cycles of branch misprediction stall any ideas?R12 is mentioned in another part of the code
lw R4,0(R3)
lw R8,0(R2)
dmul R8,R8,R4
daddi R3,R3,-8
daddi R11,R2,-8
dadd R9,R9,R8
daddi R2,R2,-8
bne R11,R12,loop
I don't know the winmips64 architecture specifically, i.e. how it is or isn't different from other MIPS pipelined implementations. So, if someone else knows specifics, please correct me if I'm wrong.
That both the branch taken and mispredict cost 2 cycles is consistent with standard 5 stage pipeline, where the branch decision (taken/not taken) is fully resolved in EX stage, and thus has to flush the instructions in prior stages, IF and ID, when it's wrong.
The first time the branch executes, it is unknown to the branch predictor, and the processor appears to move forward with an assumption of not taken; thus, message about the branch taken stall.
The last time the branch executes, it is know to the branch predictor, however, the branch is not taken, which doesn't match the prediction as the predictor is assuming it will continue to loop back as it has just been doing. Thus, the message of misprediction.
3. Mr.Noob is pondering about this strange line of MIPS instruction in his program:
beq $t3, $t9, 0 #0 is the immediate value
Which of the following statements is TRUE?
A. That instruction is an “infinite loop”.
B. That instruction can be removed from the program with NO impact on execution
result.
C. That instruction is equivalent to a branch‐lesser‐or‐equal (ble).
D. That instruction jumps to the instruction at instruction address 0.
E. None of the above.
Can I ask why is the answer B? Thank you.
The beq instruction is an I-Type, which means that part of the instruction encodes a 16-bit signed immediate. The instruction is a conditional branch. The immediate for this instruction is used as the taken branch target. If the condition is:
false, then pc := pc + 4 is the operation it performs (fall through to next instruction). Since the branch is not taken, this advances the pc in sequential manner just like any other instruction, such as add.
true, then branch is taken and the operation is that pc := pc + 4 + sxt32(imm16)*4, which transfers control to the target, usually a label in assembly language. However, since the immediate is zero, this equation evaluates to pc := pc + 4 + sxt32(0)*4, which is pc := pc + 4 + 0 -or- pc := pc + 4.
Thus, whether condition is true or false, whether the branch is taken or not taken, it has the same effect of merely advancing the pc by 1 instruction.
Using labels instead of an immediate:
beq $t3, $t9, next
next:
...
This will also produce an immediate value of 0 in the machine code for the beq.
If the immediate for a taken beq is -1, then it branches to itself, which will cause an infinite loop. If the immediate is -2, then it branches backwards by 1 instruction; if the immediate is 0, it simply goes on to the next instruction; if the immediate is 1, the branch skips one instruction.
Removing the instruction will reduce code size and could improve performance but will not otherwise affect the logic of the program — assuming the program does not somehow depend on code size or position.
Let's note that code using branches with immediates tends to be dependent upon position and if you remove an instruction, you may have to fix up these branches — that's why we use labels instead, to get the assembler to compute the proper immediate (so we're free to add and remove instructions, just reassemble).
I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.
I am currently studying for my Computer Architecture exam and came across a question that asks to illustrate (bit by bit i would assume) the values contained in the mips pipeline architecture after the 3rd stage of the sub (before the clock commutes) given the following instructions.
add $t0,$t1,$t2
sub $t3,$t3,$t5
beq $t6,$t0,16
add $t0,$t1,$t3
I am not asking for the solution to this problem however after some research i haven't had much success wrapping my mind around it so i am asking for some help/advice.
Firstly i still don't have a clear understanding of the size of the pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB). I do understand that they contain the control unit codes for the next stages and that they contain the result of the previous stage so that it can be passed in to the next one.
So that would be (please correct me if i'm wrong) +9 for ID/EX, +5 for EX/MEM and +2 for MEM/WB but i haven't managed to find a clear schema of the data that we can expect these registers to contain.
Also, i figure that we would need to use HW forwarding to forward the result of the first add to beq (because of $t0) and to forward the result of sub to the last add (because of $t3). Does this factor in to what is contained in the registers?
It would be great if someone could point me in the right direction.
Thanks lots.
The purpose of each of these intermediate registers is to hold data that might be needed in the immediate next stage or in later stages. I'll discuss one possible design, but there are really many possible designs as I'll explain.
In the fetch stage, the next instruction to be execute (to which the current PC points) is fetched from memory and PC is updated to point to the next instruction to fetch. Therefore, IF/ID would include one 4-byte field to hold the fetched instruction. There are two ways to calculate the new PC: current PC + 4 or PC + 4 + offset in case of a branch. If the fetched instruction is itself a branch instruction, then we would need to pass the new PC so that the branch target address can be calculated in the EX stage. We can add a 4-byte field in IF/ID to hold the new PC value to be passed to the EX stage through the ID stage.
In the decode stage, the opcode and its operands are determined. The opcode is at a fixed location in the instruction in MIPS. An MIPS instruction may operate on a single source register, two source registers, one source register and a sign-extended 32-bit immediate value, a sign-extended 32-bit immediate value, or no operands. We can either prepare only the required operands for the EX stage based on the opcode or prepare all the operands that might be required for any opcode. The latter design is simpler, but it requires a larger ID/EX register. In particular, two 4-byte fields are required to hold two possible source register values (the values are read from the register file in the decode stage) and a 4-byte field for the possible sign-extended immediate value. No opcode will require all of these fields, but let's prepare all of them anyway and store them at fixed locations in the ID/EX register. It simplifies the design.
We to also pass the new PC value calculate in the fetch stage to the execute stage just in case the opcode turns out to be a branch. The branch target address is calculated relative to the current PC value (the PC of the instruction following the branch in static program order). There are two possible design here: either add a bus from the new PC field in IF/ID to the EX stage or add a field in ID/EX to hold the new PC value, which can then be accessed in the EX stage. The latter design adds a 4-byte field in ID/EX.
The EX requires the opcode from the ID stage. We can choose to pass only the opcode rather than the whole instruction. But then later stages might require other parts of the instruction. Generally, in RISC pipelines, it preferable to pass to make the whole instruction available to all stages. In this way, all parts of an instruction are already available when changes are made to any stage of the pipeline in the future. So let's add a 4-byte field to ID/EX to hold the instruction.
The EX stage reads the operands and the opcode from the ID/EX register (the opcode is part of the instruction) and performs the operation specified by the opcode. The EX/MEM register has to be big enough to hold all possible results, which might include the following: a 4-byte value computed by the ALU resulting from an arithmetic or logic operation, a 4-byte value representing the calculated effective address for a memory load or store operation, a 4-byte value representing the branch target address in case of a branch instruction, and a 1-bit condition in case of a conditional branch instruction. We can use a single 4-byte field in EX/MEM for the result (whatever it represents) and a 1-bit field for the condition. In addition, as before, we need a 4-byte field to hold the instruction. Also for store instructions, we need another 4-byte field to hold the value to be stored. One possible alternative design here is that rather than storing the 1-bit condition and 4-byte branch target address in EX/MEM, they can be passed directly to the IF stage.
In the MEM stage, in case of a branch instruction, the branch target address and the branch condition are passed back from EX/MEM to the IF fetch to determine the new PC. In case of a memory store operation, the operation is performed and there is no result to be passed to any stage. In case of a memory load operation, the 4-byte value is fetched from memory and stored in a field in the MEM/WB register. In case of an ALU operation, the 4-byte result will be just passed to a field in the MEM/WB register. In addition, as before, we need a 4-byte field in MEM/WB to hold the instruction.
Finally, in the WB stage, the 4-byte result whether loaded from memory or computed by the ALU is stored in the destination register. This only occurs for instructions that produce results. Otherwise, the WB stage can be skipped.
In summary, in the design I've discussed, the sizes of intermediate registers are as follows: IF/ID is 8 bytes in size, ID/EX is 20 bytes in size, EX/MEM is 25 bits in size, and MEM/WB is 8 bytes in size.
The design decision of whether a field is required in an intermediate register to hold some value or whether it can be passed directly in the same stage to the logic that requires the value is a "circuit-level" decision. If the signals can be guaranteed to not be corrupted, and if it feasible or convenient to add a dedicated bus, they can be directly connected.
I am trying to understand how the verilog branch statement works in an immediate instruction format for the MIPS processor. I am having trouble understanding what the following Verilog code does:
IR is the instruction so IR[31:26] would give the opcode.
reg[31:0] BT = PC + 4 + { 14{IR[15]}, IR[15:0], 2'b0};
I see bits and pieces such as we are updating the program counter and that we are taking the last 16 bits of the instruction to get the immediate address. Then we need a 32 bit word so we extend 16 more zeros.
Why is it PC + 4 instead of just PC?
What is 2'b0?
I have read something about sign extension but don't quite understand what is going on here.
Thanks for all the help!
1: Branch offsets in MIPS are calculated relative to the next instruction (since the instruction after the branch is also executed, as the branch delay slot). Thus, we have to use PC +4 for the base address calculation.
2: Since MIPS uses a bytewise memory addressing system (every byte in memory has a unique address), but uses 32-bit (4-byte) words, the specification requires that each instruction be word-aligned; thus, the last two bits of the address point at the bottom byte of the instruction (0x____00).
IN full, the instruction calculates the branch target address by taking the program counter, adding 4 to account for hte branch delay slot, and then adding the sign extended (because the branch offset could be either positive or negative; this is what the 14{IR[15]} does) offset to the target.
Numbers in Verilog can be represented using number of bits tick format of the following number.
2'b11; // 2 bit binary
3'd3 ; // 3 bit decimal
4'ha ; // 4 bit hex
The format describes the following number, the bit pattern used is not changed by the format. Ie 2'b11 is identical to 2'd3;