Replacing a constant with a different value determined by the backend - llvm-clang

Consider an architecture (namely, the Infocom Z-machine) with a large, readonly memory region (called "high memory") that is only intended to store strings (and machine code, but that doesn't pose a problem). This region can only be accessed by certain instructions that display text. Of course, this means that pointers to high memory can't be dereferenced.
I'd like to write an LLVM backend for this architecture. In order to do this, I need a way to tell the backend to store certain strings in high memory, and to obtain the "packed addresses" of said strings (also to convert the strings to the Z-Machine string encoding, but that's not the point).
Ideally, I'd be able to define a C function-like macro HIGHMEM_STRING which would take a string literal and expand to an integer constant. Supposing there's a function void print_paddr(uint16_t paddr), I'd like to be able to do:
print_paddr(HIGHMEM_STRING("It is pitch black. You are likely to be eaten by a grue."));
And then the backend would know to place the string in high memory and pass its packed address to print_paddr as a parameter.
My question has three parts:
Can I implement such a macro using LLVM intrinsics, or an asm block with a special directive for the backend, or some other similar way without having to fork Clang? Otherwise, what would I have to change in Clang?
How can I annotate the LLVM IR to convey to the backend that a string should be placed in high memory and replaced with its packed address?
If HIGHMEM_STRING is too hard, or impossible, to implement as a macro, what are the alternatives?

The Hexagon backend does something similar by storing information in a special section whos base address is loaded in the GP register and the referencing instruction has an offset inside that section. Look for CONST64 to get an idea of how these are processed.
Basically when we identify the data in LLVM IR we want to put in this special section, we create a pseudo instruction with the data. When we are writing out the ELF file we switch sections to the GP-rel section, emit the data, then switch back to the text section and emit the instruction to dereference this symbol.
It'll probably be easier if you can identify these strings based on their contents rather than having the user specify them in the program text.

Related

If a CPU's stack is 1MB and a programming language is purely pass by value, how can we pass data bigger than 1MB into functions? What happens exactly?

So I hear that using pass by value, copies of the parameters are added to the call stack. Apparently, the stack size on Windows is often 1MB. Obviously though, we can easily pass around data that is far bigger than 1MB between functions/procedures (arrays of primitives/classes/hashmaps/sets or whatever)
So is my understanding of the situation wrong...or are these languages using pass by value for primitive types, but then pass by reference/pass by object model for these other data structures?
Just with a quick google, both Java & JavaScript are exclusively pass by value...so how are you able to pass around data/objects bigger than the stack size in these languages? For example, in JavaScript, it's stated: "For Array the maximum length is 4GB-1 (2^32-1)"
The difference between "pass-by-reference" and "pass-by-value" is that, in pass-by-reference calling convention, the function can change what the reference points to. In languages that don't support a pass-by-reference calling convention, you won't be able to do this directly; you'd have to simulate it by passing a mutable type that contains a reference to another object.
In pass-by-value, you pass references as values too. So the thing that's allocated on the stack, when you're passing an object that isn't a primitive value, is a copy of the pointer, not a deep copy of the data.
The aspect of this that gets confusing is the thing (or instance) of what you point to, if it's a mutable type, can be mutated.

MIPS functions and variables in stack

I have come in contact with MIPS-32, and I came with the question if a variable, for example $t0 declared having the value in one function can be altered by another and how this does have to do with stack, this is, the location of the variable in memory. Everything that I am talking is in assembly language. And more, I would like some examples concerning this use, this is, a function altering or not, a variable value of another function, and how this variable "survive" or not in terms of if the variable is given as a copy or a reference.
(I hope we can create an environment where conceptual question like that above can be explored more)
$t0 declared having the value in one function can be altered by another
$t0 is known as a call-clobbered register.  It is no different than the other registers as far as the hardware is concerned — being call clobbered vs. call preserved is an aspect of software convention, call the calling convention, which is a subset of an Application Binary Interface (ABI).
The calling convention, when followed, allows a function, F, to call another function, G, knowing only G's signature — name, parameters & their types, return type.  The function, F, would not have to also be changed if G changes, as long as both follow the convention.
Call clobbered doesn't mean it has to be clobbered, though, and when writing your own code you can use it any way you like (unless your coursework says to follow the MIPS32 calling convention, of course).
By the convention, a call-clobbered register can be used without worry: all you have to do use it is put a value into it!
Call preserved registers can also be used, if desired, but they should be presumed to be already in use by some caller (maybe not the immediate caller, but some distant caller), the values they contain must be restored before exiting the function back to return to its caller.  This is, of course, only possible by saving the original value before repurposing the register for a new use.
The two sets of register (call clobbered/preserved) serve two common use cases, namely cheap temporary storage and long term variables.  The former requires no effort to preserve/restore, while the latter both does require this effort, though gives us registers that will survive a function call, which is useful, for example, when a loop has a function call.
The stack comes into play when we need to first preserve, then restore call-preserved registers.  If we want to use call-preserved registers for some reason, then we need to preserve their original values in order to restore them later.  The most reasonable way to do that is to save them in the stack.  In order to do that we allocate some space from the stack.
To allocate some local memory, the stack pointer is decremented to reserve a function some space.  Since the stack pointer, upon return to caller, must have the same value, this space is necessarily deallocated upon return.  Hence the stack is great for local storage.  Original values of preserved registers must be also restored upon return to caller and so using local storage is appropriate.
https://www.dyncall.org/docs/manual/manualse11.html — search for section "MIPS32".
Let's also make the distinction between variables, a logical concept, and storage, a physical concept.
In high level language, variables are named and have scopes (limited lifetimes).  In machine code, we have physical hardware (storage) resources of registers and memory; these simply exist: they have no concept of lifetime.  In and of themselves these hardware resources are not variables, but places that we can use to hold variables for their lifetime/scope.
As assembly language programmers, we keep a mental (or even written) map of our logical variables to physical resources.  The compiler does the same, knowing the scope/lifetime of program variables and creating that "mental" map of variables to machine code storage.  Variables that have overlapping lifetimes cannot share the same hardware resource, of course, but when a variable is out of scope, its (mapped-to) physical resource can be reused for another purpose.
Logical variables can also move around to different physical resources.  A logical variable that is a parameter, may be passed in a CPU register, e.g. $a0, but then be moved into an $s register or into a (stack) memory location.  Such is the business of machine code.
To allocate some hardware storage to a high level language (or pseudo code) variable, we simply initialize the storage!  Hardware resources are necessarily constantly being repurposed to hold a different logical variable.
See also:
How a recursive function works in MIPS? — for discussion on variable analysis.
Mips/assembly language exponentiation recursivley
What's the difference between caller-saved and callee-saved in RISC-V

Function Call: Labels into memory addresses

I have difficulty to understand the correct sequence of events.
When a program written in abstract language is compiled, it is translated into machine code.
Subsequently, only after the program has run, it is loaded into the ram, in the code segment.
At this point, each instruction in the program will be on a specific memory address.
When a function is called in assembly, the Call statement is typically followed by a label.
I assume this label will be replaced with the function's memory address by the compiler.
And this is where I absolutely can't understand.
If the instructions are loaded into memory only when the program is running, thus obtaining each instruction its own memory address, how does the compiler know the memory address to which the label corresponds?
If the function is not yet in memory, how can the program, compiled in binary code where the labels are nomore available, know the memory address, corrisponding to that label, where the function will be loaded at the moment of execution? I am a bit confused. Help me.
A program contains several "sections" (some are optional):
a section that holds code, usually referred to as the Text section
a section that holds initial values for mutable global data
a section that holds immutable constants, usually called rodata
a section that has a set of relocation records
A section is stored as a contiguous chunk or block of memory in the program file on disc.
The loader creates memory chunks and loads the code, data, rodata into those; a stack will have been created, depending on the os, either by the loader, but also possibly by the forking of the parent process that creates the child process.
Knowing the final addresses, the loader also processes the relocation records.  These relocations describe where in the text and data sections updates are needed for the final addresses of the sections loaded into memory.
The relocation mechanism is general purpose, as: code can refer to code, code can refer to data, data can refer to code, and data can refer to data.
A single relocation record describes a reference that needs to be updated.  Each record describes:
a referring source — at what offset in the text or data section to make an address update
a referring target — which section is being referred to: code or data
what kind of update to make (some architectures have complex instruction encodings)
Some updates are for ordinary pointers, while others are for instructions.  Instruction set architectures that have complex instruction offset/immediate encodings, like MIPS, RISC V, HP-PA, need to inform of the immediate encoding method.
Usually the referrer already has an offset, so the update is a matter of addition/summation of the base of the section being referred to, to the offset already in place at the referring source.
Other metadata in the program describes where to start, e.g. the initial the program counter, which would be as an offset into the text section.
Most processors today support (as fuz describes) position independent code (PIC).  This is typically done via pc-relative addressing.  The processor performs branches and calls within the single text section using pc-relative addressing modes, and thus, no relocation records are required for these instructions.
Dynamically loaded libraries add complexity since each DLL, and the main program to run, each have the format of a program, i.e. they each will have their own sections; each has its own text section.  The relocations will also be capable of describing references to symbol imports, supported by additional sections holding symbol names, imports, and exports.
Object files (compiler output, pre-linking) typically follow this format as well.  A single object file has these sections, with relocation records, symbol names, imports, exports.  The linker's job is to merge object files into a single program or larger object file.  During merge the linker resolves some relocations, but it cannot necessarily resolve all of them, so some may remain for the os loader to resolve.
Let's imagine that, on a system using PIC, there is a reference: a call (code-to-code), from one object file to another, and that the linker merges these object files.  There will be a relocation record in the caller that refers to an imported symbol name (and in the other object file, an export of a symbol defined as some offset with its text section).  Once the two object files' sections are merged (e.g. by simply concatenating them into one larger text section), the call there is now an intra-section reference, and the linker can compute the delta between the addresses of the caller and callee, and these will not change by future linking or loading.  The linker will adjust the offset/immediate in the call instruction with that delta, and, knowing this reference is now resolved, omits this relocation record in the merge.
For reference, see:
ELF (Executable and Linkable Format)
COFF (Common Object File Format)
Windows Portable Executable Format
Relocation
Object file
Executable
Position Independent Code
Addressing Mode
TL:DR: the distance from a call to its target is a link-time constant.
The .o object files you get from assembling asm have relocation records for symbols that aren't defined in that file.
When you link these .o files into an executable or library, the linker lays out the .text section from each .o into one big .text section for the executable and calculates the relative distance for each call to reach its target. It encodes that relative displacement right into the machine code for each call.
At run time no further relocations are needed: wherever the whole executable is loaded in memory, distances between instructions don't change. So no runtime relocations are ever needed for relative calls.
Related: Why are global variables in x86-64 accessed relative to the instruction pointer?

Do scripting languages use types behind the scenes?

I read that it is important to use types , as they have specific size of memory, which helps fetch the necessary chunk of memory and to also know how to interpret the data.
Is this true? and if yes, this means that scripting languages use this mechanism behind the scenes even though they are dynamically typed?
Correct there is a type behind the scenes. Most untyped languages are implemented with all data wrapped up in a generic VALUE type (Names vary by author and language). Every piece of data is somehow representing as an instance of the VALUE data-structure.
Among the things carried in this data structure are the size in memory and some way of describing "what am I".

Garbage collection and runtime type information

The fixnum question brought my mind to an other question I've wondered for a long time.
Many online material about garbage collection does not tell about how runtime type information can be implemented. Therefore I know lots about all sorts of garbage collectors, but not really about how I can implement them.
The fixnum solution is actually quite nice, it's very clear which value is a pointer and which isn't. What other commonly used solutions for storing type information there is?
Also, I wonder about fixnum -thing. Doesn't that mean that you are being limited to fixnums on every array index? Or is there some sort of workaround for getting full 64-bit integers?
Basically to achieve accurate marking you need meta-data indicating which words are used as pointers and which are not.
This meta-data could be stored per reference, as emacs does. If for your language/implementation you don't care much about memory use, you could even make references bigger than words (perhaps twice as big), so that every reference can carry type information as well as its one-word data. That way you could have a fixnum the full size of a 32 bit pointer, at the cost of references all being 64 bit.
Alternatively, the meta-data could be stored along with other type information. So for example a class could contain, as well as the usual function pointer table, one bit per word of the data layout indicating whether or not the word contains a reference that should be followed by the garbage collector. If your language has virtual calls then you must already have a means of working out from an object what function addresses to use, so the same mechanism will allow you to work out what marking data to use - typically you add an extra, secret pointer at the start of every single object, pointing to the class which constitutes its runtime type. Obviously with certain dynamic languages the type data pointed to would need to be copy-on-write, since it is modifiable.
The stack can do similar - store the accurate marking information in data sections of the code itself, and have the garbage collector examine the stored program counter, and/or link pointers on the stack, and/or other information placed on the stack by the code for the purpose, to determine which code each bit of stack relates to and hence which words are pointers. Lightweight exception mechanisms tend to do a similar thing to store information about where try/catch occurs in the code, and of course debuggers need to be able to interpret the stack too, so this can quite possibly be folded in with a bunch of other stuff you'd already be doing to implement any language, including ones with built-in garbage collection.
Note that garbage collection doesn't necessarily need accurate marking. You could treat every word as a pointer, regardless of whether it really is or not, look it up in your garbage collector's "big list of everything" to decide whether it plausibly could refer to an object that has not yet been marked, and if so treat it as a reference to that object. This is simple, but the cost of course is that it's somewhere between "quite slow" and "very slow", depending on what data structures your gc uses for the lookup. Furthermore, sometimes an integer just so happens to have the same value as the address of an unreferenced object, and causes you to keep a whole bunch of objects which should have been collected. So such a garbage collector cannot offer strong guarantees about unreferenced objects ever being collected. This might be fine for a toy implementation or first working version, but is unlikely to be popular with users.
A mixed approach might, say, do accurate marking of objects, but not of regions of the stack where things get particularly hairy. For example if you write a JIT which can create code where a referenced object address appears only in registers, not in your usual stack slots, then you might need to non-accurately follow the region of the stack where the OS stored the registers when it descheduled the thread in question to run the garbage collector. Which is probably quite fiddly, so a reasonable approach (potentially resulting in slower code) would be to require the JIT to always keep a copy of all pointer values it's using on the accurately marked stack.
In Squeak (also Scheme and many others dynamic languages I guess) you have SmallInteger, the class of signed 31-bit integers, and classes for arbitrarily big integers, e.g. LargePositiveInteger. There could very well be other representations, 64-something-bit integers either as full objects or with a couple bits as "I'm not a pointer" flags.
But arithmetic methods are coded to handle over/under-flows, such that if you add one to SmallInteger maxVal, you get 2^30 + 1 as an instance of LargePositiveInteger, and if you subtract one back from it, you get back 2^30 as a SmallInteger.