Why MIPS doesn't take additional function arguments in $v0 and $v1 - mips

According to the MIPS documentation, functions output is stored in $v0-$v1 (up to 64 bits), and the function arguments are given in $a0-$a3, where any additional arguments are written to the stack.
Since the function is allowed to overwrite the values of $v0-$v1, wouldn't it be better to pass the function fifth argument (if such exist) on $v0?
What is the motivation for using the stack in this case?

You are right that the $v registers are available to be used to pass parameters.
MIPS has, at times, updated the calling convention, for example: the "MIPS EABI 32-bit Calling Convention", redefines 4 of the original $t registers, $8-$11, as additional argument registers, to pass up to 8 integer arguments in total.
We might also consider that $at aka $1 — the assembler temp — is also available at the point for parameter passing.
However, object model invocations, e.g. those involving vtables, thunks and other stubs such as long calls, perhaps cross library (DLL) calls, can require an available register or two that are scratch, so it would not necessarily be best to use every one of the scratch registers for arguments.
Discussion
In general, other than that I'm not sure why they don't just get rid of most of the $t registers (and $v registers) and make them all $a registers — these would only be used when needed, and otherwise those unused argument registers would serve the same purpose as $t registers.  The more parameters, the fewer scratch registers — though in both caller and callee — but I think tradeoff can be made instead of guaranteeing some larger minimum number of scratch registers as in current ABIs.
Still, without some bare minimum number of scratch registers, you would sometimes end up using memory, spilling already computed arguments to memory in order to have free registers to compute the last couple of parameters, only to have to reload those spilled values back into registers.  If that were to happen, might as well have passed some of them in memory in the first place, especially since the callee may also have to store some of the arguments to memory anyway (e.g. the callee is not a leaf function, and parameters are needed after further calls).
8 argument registers is probably already on the tapering end of the curve of usefulness, so past thereabouts adding more probably has negligible returns on real code bases.
Also, a language can invent/define its own calling convention: these calling conventions are the standard for C language interoperability.  Even the C compiler can use custom calling conventions when it is certain that such language interoperability is not required, as we can also do in assembly when we know more details about function implementations (i.e. their internal register usages) than just the function signature.
Nicely collected set details on various calling conventions:
https://www.dyncall.org/docs/manual/manualse11.html
Addendum:
Let's assume a machine with only 2 registers, call them A & B, and it uses both to pass parameters.  Let's say a first parameter is computed into A (using B register as scratch if needed).  In computing the value of the 2nd parameter, for B, it may run out of scratch registers, especially if the expression for that actual argument is complicated.  When out of registers, we spill something to memory, here let's say, the already computed A.  Now the parameter for B can be computed with that extra register. However, the A parameter value, now in memory, needs to return back to the A register before the call.  Thus, this is worse than passing A in memory b/c the caller has to do both a store and a load, whereas passing in memory means just the store.
Now add to that situation that the callee may have to store the parameter to memory as well (various possible reasons).  That means another store to memory.  So, in total, if the above scenario coincides with this one, then a store, a load and another store — contrasted with memory parameter passing, which would have just the one store by the caller.

Related

How to define, allocate, and also initialise the values of an array of user defined length

I am quite new to MIPS32 and am working on an assignment that requires me to first ask the user for the length of the array they would like to define, and then ask them what the respective values are. I have written a rough C code which does the same, which is as follows
int main()
{
int N;
scanf("%d\n", &N); // will be the first line, will tell us the number of inputs we will get
int i=0, A[N]; // (1)
// Loop to enter the values to be sorted into an array we have made A[]. the values are entered as a1, a2.... and so on.
while(i != N)
{
scanf("%d\n", &A[i]);
i++;
}
}
I am mainly having trouble with how I write the code above, mainly line (1) in MIPS32. I know that defining the size of the array in the data section itself is not an option, but I am unsure about how to dynamically define an array of size N and then also store values into the array. Any help or advice on what I can do would be really helpful.
Arrays can be stored in global, stack or heap memory.
Global memory
Global memory is essentially fixed-sized at program build time — you put a label in your .data and reserve some constant amount of space, using .space or other data directive.
One approach here is to have a maximum (say 100), so reserve space for that many, and program a limit test to make sure the code doesn't try to use more than the pre-defined maximum.
As an exception, the last global data item can be used to to store an array of relatively unknown size.  This happens to work in QtSpim and MARS, because a fair amount of space behind the global data is there for use to use.  This approach is not very professional, since the code can't really know at what size this will no longer work, but is valid approach for sample toy programs and throw away assignments.  Put a label at the end of your global data and reserve no space or just one word of space.
Integer element arrays have alignment requirements, so when putting global data after string data often requires use of alignment (as a separate directive or by reserving a word, e.g. .word, which will inject alignment automatically).
Heap memory
Heap memory can be allocated using MARS/QtSpim syscall #9.  If the allocation fails, the size was too large, though if it succeeds you have all the space that was asked for.  The syscall #9 returns a pointer to the newly allocated memory in $v0, and you will generally want to store that value somewhere (register or global) for later use.
Stack memory
The stack grows in the downward direction: stack memory can be allocated by decrementing the stack pointer.  The stack pointer — after a decrement — refers to the newly allocated memory.  You can decrement the stack pointer by a fixed amount or by a variable amount.  In your case, you would use a variable amount.  It is generally required that the stack pointer maintain alignment, so in computing the amount to decrement, we would round up.  If you need multiple entities, you can decrement the stack pointer multiple times, or, sum the sizes together and decrement once, which would be the more common approach.
Before (or as) a function returns to its caller, the stack pointer must be returned to the value it had upon function entry.  This releases any allocated stack memory and returns to the caller the same stack environment that it had when it made the function call.  It should stand to reason that it would be a logic error to return released memory to a caller, so this approach cannot be used within a function that needs to return an array to its caller.
Any function that uses syscall #10 to terminate the program does not have to honor this requirement, since the program terminates immediately upon that syscall.  This approach is often used to exit the main — MARS requires it, since it doesn't "call" the main, whereas QtSpim, by default, inserts a small startup stub that does "call" main.

Cuda _sync functions, how to handle unknown thread mask?

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

MIPS pipeline registers length (IF/ID, ID/EX, EX/MEM, MEM/WB)

I am currently studying for my Computer Architecture exam and came across a question that asks to illustrate (bit by bit i would assume) the values contained in the mips pipeline architecture after the 3rd stage of the sub (before the clock commutes) given the following instructions.
add $t0,$t1,$t2
sub $t3,$t3,$t5
beq $t6,$t0,16
add $t0,$t1,$t3
I am not asking for the solution to this problem however after some research i haven't had much success wrapping my mind around it so i am asking for some help/advice.
Firstly i still don't have a clear understanding of the size of the pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB). I do understand that they contain the control unit codes for the next stages and that they contain the result of the previous stage so that it can be passed in to the next one.
So that would be (please correct me if i'm wrong) +9 for ID/EX, +5 for EX/MEM and +2 for MEM/WB but i haven't managed to find a clear schema of the data that we can expect these registers to contain.
Also, i figure that we would need to use HW forwarding to forward the result of the first add to beq (because of $t0) and to forward the result of sub to the last add (because of $t3). Does this factor in to what is contained in the registers?
It would be great if someone could point me in the right direction.
Thanks lots.
The purpose of each of these intermediate registers is to hold data that might be needed in the immediate next stage or in later stages. I'll discuss one possible design, but there are really many possible designs as I'll explain.
In the fetch stage, the next instruction to be execute (to which the current PC points) is fetched from memory and PC is updated to point to the next instruction to fetch. Therefore, IF/ID would include one 4-byte field to hold the fetched instruction. There are two ways to calculate the new PC: current PC + 4 or PC + 4 + offset in case of a branch. If the fetched instruction is itself a branch instruction, then we would need to pass the new PC so that the branch target address can be calculated in the EX stage. We can add a 4-byte field in IF/ID to hold the new PC value to be passed to the EX stage through the ID stage.
In the decode stage, the opcode and its operands are determined. The opcode is at a fixed location in the instruction in MIPS. An MIPS instruction may operate on a single source register, two source registers, one source register and a sign-extended 32-bit immediate value, a sign-extended 32-bit immediate value, or no operands. We can either prepare only the required operands for the EX stage based on the opcode or prepare all the operands that might be required for any opcode. The latter design is simpler, but it requires a larger ID/EX register. In particular, two 4-byte fields are required to hold two possible source register values (the values are read from the register file in the decode stage) and a 4-byte field for the possible sign-extended immediate value. No opcode will require all of these fields, but let's prepare all of them anyway and store them at fixed locations in the ID/EX register. It simplifies the design.
We to also pass the new PC value calculate in the fetch stage to the execute stage just in case the opcode turns out to be a branch. The branch target address is calculated relative to the current PC value (the PC of the instruction following the branch in static program order). There are two possible design here: either add a bus from the new PC field in IF/ID to the EX stage or add a field in ID/EX to hold the new PC value, which can then be accessed in the EX stage. The latter design adds a 4-byte field in ID/EX.
The EX requires the opcode from the ID stage. We can choose to pass only the opcode rather than the whole instruction. But then later stages might require other parts of the instruction. Generally, in RISC pipelines, it preferable to pass to make the whole instruction available to all stages. In this way, all parts of an instruction are already available when changes are made to any stage of the pipeline in the future. So let's add a 4-byte field to ID/EX to hold the instruction.
The EX stage reads the operands and the opcode from the ID/EX register (the opcode is part of the instruction) and performs the operation specified by the opcode. The EX/MEM register has to be big enough to hold all possible results, which might include the following: a 4-byte value computed by the ALU resulting from an arithmetic or logic operation, a 4-byte value representing the calculated effective address for a memory load or store operation, a 4-byte value representing the branch target address in case of a branch instruction, and a 1-bit condition in case of a conditional branch instruction. We can use a single 4-byte field in EX/MEM for the result (whatever it represents) and a 1-bit field for the condition. In addition, as before, we need a 4-byte field to hold the instruction. Also for store instructions, we need another 4-byte field to hold the value to be stored. One possible alternative design here is that rather than storing the 1-bit condition and 4-byte branch target address in EX/MEM, they can be passed directly to the IF stage.
In the MEM stage, in case of a branch instruction, the branch target address and the branch condition are passed back from EX/MEM to the IF fetch to determine the new PC. In case of a memory store operation, the operation is performed and there is no result to be passed to any stage. In case of a memory load operation, the 4-byte value is fetched from memory and stored in a field in the MEM/WB register. In case of an ALU operation, the 4-byte result will be just passed to a field in the MEM/WB register. In addition, as before, we need a 4-byte field in MEM/WB to hold the instruction.
Finally, in the WB stage, the 4-byte result whether loaded from memory or computed by the ALU is stored in the destination register. This only occurs for instructions that produce results. Otherwise, the WB stage can be skipped.
In summary, in the design I've discussed, the sizes of intermediate registers are as follows: IF/ID is 8 bytes in size, ID/EX is 20 bytes in size, EX/MEM is 25 bits in size, and MEM/WB is 8 bytes in size.
The design decision of whether a field is required in an intermediate register to hold some value or whether it can be passed directly in the same stage to the logic that requires the value is a "circuit-level" decision. If the signals can be guaranteed to not be corrupted, and if it feasible or convenient to add a dedicated bus, they can be directly connected.

When do we create base pointer in a function - before or after local variables?

I am reading the Programming From Ground Up book. I see two different examples of how the base pointer %ebp is created from the current stack position %esp.
In one case, it is done before the local variables.
_start:
# INITIALIZE PROGRAM
subl $ST_SIZE_RESERVE, %esp # Allocate space for pointers on the
# stack (file descriptors in this
# case)
movl %esp, %ebp
The _start however is not like other functions, it is the entry point of the program.
In another case it is done after.
power:
pushl %ebp # Save old base pointer
movl %esp, %ebp # Make stack pointer the base pointer
subl $4, %esp # Get room for our local storage
So my question is, do we first reserve space for local variables in the stack and create the base pointer or first create the base pointer and then reserve space for local variables?
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc. Does C have a specific convention when it creates the machine code?
My reasoning is that all the code in a function would be relative to the base pointer, so as long as that function follows the convention according to which it created a reference of the stack, it just works?
Few related links for those are interested:
Function Prologue
In your first case you don't care about preservation - this is the entry point. You are trashing %ebp when you exit the program - who cares about the state of the registers? It doesn't matter any more as your application has ended. But in a function, when you return from that function the caller certainly doesn't want %ebp trashed. Now can you modify %esp first then save %ebp then use %ebp? Sure, so long as you unwind the same way on the other end of the function, you may not need to have a frame pointer at all, often that is just a personal choice.
You just need a relative picture of the world. A frame pointer is usually just there to make the compiler author's job easier, actually it is usually there just to waste a register for many instruction sets. Perhaps because some teacher or textbook taught it that way, and nobody asked why.
For coding sanity, the compiler author's sanity etc, it is desirable if you need to use the stack to have a base address from which to offset into your portion of the stack, FOR THE DURATION of the function. Or at least after the setup and before the cleanup. This can be the stack pointer(sp) itself or it can be a frame pointer, sometimes it is obvious from the instruction set. Some have a stack that grows down (in address space toward zero) and the stack pointer can only have positive offsets in sp based address (sane) or some negative only (insane) (unlikely but lets say its there). So you may want a general purpose register. Maybe there are some you cant use the sp in addressing at all and you have to use a general purpose register.
Bottom line, for sanity you want a reference point to offset items in the stack, the more painful way but uses less memory would be to add and remove things as you go:
x is at sp+4
push a
push b
do stuff
x is at sp+12
pop b
x is at sp+8
call something
pop a
x is at sp+4
do stuff
More work but can make a program (compiler) that keeps track and is less error prone than a human by hand, but when debugging the compiler output (a human) it is harder to follow and keep track. So generally we burn the stack space and have one reference point. A frame pointer can be used to separate the incoming parameters and the local variables using base pointer(bp) for example as a static base address within the function and sp as the base address for local variables (athough sp could be used for everything if the instruction set provides that much of an offset). So by pushing bp then modifying sp you are creating this two base address situation, sp can move around perhaps for local stuff (although not usually sane) and bp can be used as a static place to grab parameters if this is a calling convention that dictates all parameters are on the stack (generally when you dont have a lot of general purpose registers) sometimes you see the parameters are copied to local allocation on the stack for later use, but if you have enough registers you may see that instead a register is saved on the stack and used in the function instead of needing to access the stack using a base address and offset.
unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int x )
{
unsigned int y;
y = x;
return(more_fun(x+1)+y);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e2800001 add r0, r0, #1
c: ebfffffe bl 0 <more_fun>
10: e0800004 add r0, r0, r4
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Do not take what you see in a text book, white board (or on answers in StackOverflow) as gospel. Think through the problem, and through alternatives.
Are the alternatives functionally broken?
Are they functionally correct?
Are there disadvantages like readability?
Performance?
Is the performance hit universal or does it depend on just how
slow/fast the memory is?
Do the alternatives generate more code which is a performance hit but
maybe that code is pipelined vs random memory accesses?
If I don't use a frame pointer does the architecture let me regain
that register for general purpose use?
In the first example bp is being trashed, that is bad in general but this is the entry point to the program, there is no need to preserve bp (unless the operating system dictates).
In a function though, based on the calling convention one assumes that bpis used by the caller and must be preserved, so you have to save it on the stack to use it. In this case it appears to want to be used to access parameters passed in by the caller on the stack, then sp is moved to make room for (and possibly access but not necessarily required if bp can be used) local variables.
If you were to modify sp first then push bp, you would basically have two pointers one push width away from each other, does that make much sense? Does it make sense to have two frame pointers anyway and if so does it make sense to have them almost the same address?
By pushing bp first and if the calling convention pushes the first paramemter last then as a compiler author you can make bp+N always or ideally always point at the first parameter for a fixed value N likewise bp+M always points at the second. A bit lazy to me, but if the register is there to be burned then burn it...
In one case, it is done before the local variables.
_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.
The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).
Does C have a specific convention when it creates the machine code?
No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)
In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.
The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)
If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.
If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.
IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.
Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).
For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.
But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do
bigfunc:
push %ebp
lea -112(%esp), %ebp # first arg at ebp+8+112 = 120(%ebp)
sub $236, %esp # locals from -124(%ebp) ... 108(%ebp)
# saved EBP at 112(%ebp), ret addr at 116(%ebp)
# 236 was chosen to leave %esp 16-byte aligned.
Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.
bigfunc2: # first arg at 4(%esp)
sub $252, %esp # first arg at 252+4(%esp)
push %ebp # first arg at 252+4+4(%esp)
lea 140(%esp), %ebp # first arg at 260-140 = 120(%ebp)
push %edi # save the other call-preserved regs
push %esi
push %ebx
# %esp is 16-byte aligned after these pushes, in case that matters
(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)
But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)
These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.
AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.