MIPS range of jump instruction - mips

I am reading the book 'Computer Organization and Design' by Patterson and Hennessy and got interested in MIPS.
I have doubts in finding the range of a jump/branch instruction. Also in determining the number of branch/jump instructions required to get to a specific address.
Can someone provide an explanation of how this has to be calculated i.e. Considering PC at a specific address and finding the number of branch/jump instructions needed to go to a different address? For example, what if PC is at 0x10001010, what is the range of addresses of branch and jump instructions?
Or can you direct me to some online resource or book which would help me in getting a better understanding of these?

The following is all for MIPS-32.
Branch B, BEQ, BNE, etc. instructions have a 16 bit signed word offset field, allowing a branch to an address +/- 128kBytes from the current location. A jump J instruction specifies an address within the current 256MByte region specified by PC's most significant 4 bits : 26<<2 bits(this is not a relative address). To branch to an arbitrary address anywhere in the 4GB address space, use JR (jump register) which jumps to an address contained in a general purpose register.
It takes either a single branch or jump instruction, or a register load followed by a JR to jump to an arbitrary address, depending how far away the address is.
The best book for MIPS programming is still See MIPS Run. You can also find MIPS architecture reference manuals at mips.com (registration required). The most relevant document is MIPS32® Architecture for Programmers Volume II: The MIPS32® Instruction Set.

Related

What does the TargetWrite/IorD Control Line do on a multicycle MIPS processer

We learned all the main details about control lines and the general functionality of the MIPS chip in single cycle and also with pipelining.
But, in multicycle the control lines aren't identical in addition to other changes.
Specifically what does the TargetWrite (ALUout) and IorD control lines actually modify?
Based on my analysis, TW seems to modify where the PC points to depending on the bits it receives (for Jump, Branch, or standard moving to the next line)... Am I missing something?
Also what exactly does the IorD line do?
I looked at both course textbooks: See Mips Run and the Computer Architecture: A Quantitative Approach by Patterson and Hennessy which don't seem to mention these lines...
First, let's note that this block diagram does not have separate instruction memory and data memory.  That means that it either has a unified cache or goes directly to memory.  Most other block diagrams for MIPS will have separate dedicated Instruction Memory (cache) and Data memory (cache).  The advantage of this is that the processor can read instructions and read/write data in parallel.  In the a simple version of a multicycle processor, there is likely no need to read instructions and data in parallel, so a unified cache simplifies the hardware.
So, what IorD is doing is selecting the source for the address provided to the Memory — as to whether it is doing a fetch cycle for an instruction, or a read/write from/to data.
When IorD=0 then the PC provides the address from which to read (i.e. instruction fetch), and, when IorD=1 then the ALU provides the address to read/write data from.  For data operations, the ALU is computing a base + displacement addressing mode: Reg[rs] + SignExt32(imm16) as the effective address to use for the data read or write operation.
Further, let's note that this block diagram does not contain a separate adder for incrementing the PC by 4, whereas most other block diagrams do.  Lookup any of the first few MIPS single cycle datapath images, and you'll see the dedicated adder for that PC increment.  Using a dedicated adder allows the PC to be incremented in parallel with operations done by the ALU, whereas omitting that dedicated adder means that the main ALU must perform the increment of the PC.  However, this probably saves transistors in a simple version of a multicycle implementation where the ALU is not in use every cycle, and so can be used otherwise.
Since Target has a control TargetWrite, we might presume this is an internal register that might be useful in buffering the intended branch target address, for example, if the branch target is computed in one cycle, and finally used in another.
(I thought this could be about buffering for branch delay slot implementation (since those branches are delayed one instruction), but were that the case, the J-Type instructions would have also gone through Target, and they don't.)
So, it looks to me like the machinery there for this multicycle processor is to handle the branch instructions, say beq, which has to:
compute the next sequential PC address from PC + 4
compute the branch target address from (PC+4) + SignExt32(imm32)
compute the branch condition (does Reg[rs] == Reg[rt] ?)
But what order would they be computed?  It is clear from control signals in state 0 is that: PC+4 is computed first, and written back to the PC, for all instructions (i.e. for branches, whether the branch is taken or not).
It seems to me that in a next cycle, (PC+4) + SignExt32(imm16) is computed (by reusing the prior PC+4 which is now in the PC register — this result is stored in Target to buffer that value since it doesn't yet know if the branch is taken or not.  In a next cycle, contents of rs and rt are compared for equality and if equal, the branch should be taken, so PCSource=1, PCWrite=1 selects the Target from the buffer to update the PC, and if not taken, since the PC already has been updated to PC+4, that PC+4 stands (PCWrite=0, PCSource=don't care) for the start of the next instruction.  In either case the next instruction runs with what address the PC holds.
Alternately, since the processor is multicycle, the order of computation could be: compute PC+4 and store into the PC.  Compute the branch condition, and decide what kind of cycle to run next, namely, for the not-taken condition, go right to the next instruction fetch cycle (with PC+4 in the PC), or, for taken branch condition, compute (PC+4) + SignExt32(imm16) and put that into the PC, and then go on to the next instruction fetch cycle.
This alternative approach would require dynamic alteration of the cycles/state for branches, so would complicate the multicycle state machine somewhat and would also not require buffering of a branch Target — so I think it is more likely the former rather than this alternative.

MIPs Instruction Set Latency (in cycles) and Pseudo Instruction

I've been looking for a reference table/table sheet with all MIPs instruction set latency (in clock cycles) and couldn't find it anywhere, just a few instructions here and there... like "add: 1 cycle, load word: 5 cycles, ...". Does anyone has a reference for these instructions (in MIPs) with their latency?
Also, I'm searching for pseudo-instructions translations and their real implementations! I've found a bit of it in the book "Computer Organization and Design; Patterson".

In which pipeline stage is branch decision been made?

In which RISC pipeline stage is branch decision been made? Is it in the "Decode" or "Executes" or other stages? Assume the pipeline have 5 stages - "IF", "ID", "EX", "MEM" and "WB".
There are a few ways to implement this in a classic 5-stage RISC in general. For unconditional direct (not register) branches, obviously you can detect them in ID and have the target PC ready for the next IF cycle (with 1 cycle of branch latency, i.e. 1 wasted IF cycle if you don't hide that latency somehow, e.g. MIPS's branch delay slot or branch prediction).
Some toy pipelines like described in this answer do the simplest thing and evaluate in ALU in EX, forwarding to a muxer between PC+4 and PC+4+rel_offset and eventually on to IF with 3 cycle branch latency. (End of EX to start of IF)
Actual commercial MIPS I (R2000) evaluated branch conditions in the first half-cycle of EX, forwarding to IF which only needed an address in the second half-cycle. See How does MIPS I handle branching on the previous ALU instruction without stalling? This gives a branch latency of 1 cycle, short enough to be fully hidden by 1 branch-delay slot, even for conditional or indirect jr $reg branches.
This half-cycle speed is why MIPS branch conditions are simple, only checking the whole register for non-zero or not, or checking the MSB (sign bit) for non-zero. Simple RISCs with a FLAGS / status register (like PowerPC or ARM) could use a similar strategy of very quickly checking a flags condition.
(Note that RISC-V allows a full set of branch conditions; as described in RISC-V's design rationale, checking a whole register for all-zeros in modern CMOS designs is apparently not much shorter gate-delay than comparing two registers for equality or even > or < with a good comparator, presumably something smarter than subtract with ripple-carry.
RISC-V assumes branch-prediction will hide branch delays.)
The previous version of this answer incorrectly claimed that MIPS I evaluated branch conditions in ID itself. A toy pipeline in this question does that, but that would require the inputs to be ready earlier than usual. It introduces the problem of a b?? instruction stalling while waiting for the EX result of the previous ALU instruction, like in common sequences like slt $at, $t1, $t2 / bnez $at, target, i.e. the expansion of a pseudo-instruction like blt $t1, $t2.
Wikipedia's Classic RISC (5-stage pipeline) article's Instruction Decode section was misleading at best, but has been fixed. It now says "The branch condition is computed in the following cycle (after the register file is read)" - I think that was a bugfix, not just clarification: this is all described in the ID section, implying it happened there without explicit phrasing to the contrary. Also, the still-present claim that "Some architectures made use of the Arithmetic logic unit (ALU) in the Execute stage, at the cost of slightly decreased instruction throughput." makes no sense if it wasn't talking about evaluating them earlier, since nothing else could be using the ALU during that time in a scalar in-order pipeline.
Other sources (like these slides: http://home.deib.polimi.it/santambr/dida/phd/wonderland/2014/doc/PDF/4_BranchHazard_StaticPrediction_V0.pdf) says "Branch Outcome and Branch Target Address are ready at the end of the EX stage (3th stage)" for a classic MIPS beq instruction. That's not how commercial R2000 worked, but may be describing a simple MIPS implementation from a textbook or course material that does work that way.
Much discussion of MIPS is actually about hypothetical MIPS-like 5-stage RISC pipelines in general, not real MIPS R2000, or the classic Stanford MIPS CPU that R2000 was based on (but it was a full re-design). So it's hard to know whether something you find about "MIPS" applies to R2000 (gcc -march=mips1) or if it's for a simplified teaching version of MIPS.
Some "MIPS" implementations aren't even the same ISA, e.g. without branch-delay slots (which complicate exception handling significantly).
This originally wasn't a MIPS question at all, just generic classic
5-stage RISC. There were multiple early RISC ISAs, many of them originally designed around a 5-stage pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline). I don't know a lot about their internals:
Different architectures could make different choices, e.g. stall or use branch prediction + speculative fetch/decode if needed while they wait for the branch result to be ready from whatever stage produces it.
And even speculative execution is possible, even with a static prediction like forward not-taken / backward taken. If still in-order, mis-speculation can be caught before it reaches write-back or MEM. You don't want any speculative stores written to cache, but you can definitely catch it by the time the branch reaches EX. All instructions which have a control dependency on the branch are younger and therefore are in earlier pipeline stages (if present at all; IF could have missed in I-cache).

Query about MIPS R3051 pipeline behaviour (MIPS-I architecture)

I am currently implementing a MIPS R3051 in software as part of my university project.
I notice in the programmers manual from IDT it specifies that computational instructions can access the results of other computational instructions ahead of them in the pipeline at their RD stage, even though the ahead instruction has not yet committed its results to the relevant register in the WB stage. This is done via "special logic within the execution engine" to prevent a stall being necessary.
My query is does this also apply to non-computational instructions (like a jump-type instruction for example)?
An example: if an ADD instruction calculates a value at its ALU stage destined for r1, with a JR [r1] instruction behind it in the pipeline at RD, will the JR instruction get:
(a) the old contents of r1
or
(b) will this "special logic" allow the new value of r1 to be forwarded to it? or
(c) will the pipeline stall until r1 has been committed properly at WB?
Apologies if this is asked elsewhere (I have not spotted it). Many thanks.
Regards,
Phil
The key here is to keep well in mind that this "special logic" is only an optimization: it makes things faster, here bypassing something so to avoid a stall, but it must still insure that the result is unchanged. Otherwise it would be impossible or at least to difficult to program with this hardware.
So, to answer your question, you will see either case (b) or (c) but never case (a).

Instruction Execution in MIPS

This is an abstract view of the implementation of the MIPS subset showing the
major functional units and the major connections between them
Why we need to add the result of (PC+4) with instruction address?
I know that the PC (Program Counter) is a register in a computer processor that contains the address (location) of the instruction being executed at the current time, but i didn't understand why we add the second adder in this picture?
Some of the operations that can be performed by the CPU are 'jumps'.
If your operation is a Jump, from the second block you get the address of the new instructions OR the lenght of the jump you have to do.
It's not the instruction address, the output of the instruction memory is an instruction itself.
They've obviously hidden most of the components (there's NO control circuitry). What they probably meant is the data path for branches, though they really should have put at least the link with the ALU output in there. Even so it would be better to explicitly decode the instruction, sign extend and shift left. So it's really inaccurate, but I don't see what else they could mean.