Dwarf offsets and shared objects vs executables - backtrace

OK, I've used the Linux Dwarf ldw library to convert backtrace_symbols output to source code and line numbers but I've hit a snag. backtrace_symbols gives offsets in memory from which I subtract the base address (obtained using dladdr()) before using as input into Dwarf. But it seems that for the parent executable, I should NOT subtract the base address because the Dwarf offsets seem to include it.
So how do I either distinguish between EXE and SO in my code (I'm hoping there's something better than 'look for ending .so') or is there a different function I can call that will obtain the base address or zero for the parent EXE?

Yes, you are right. If the executable is a ET_EXEC (not a DT_DYN i.e. it is not a position-independant executable), then the virtual addresses in the DWARF are the real virtual addresses in your program image. For DT_DYN the addresses in the DWARF are offsets from the base address of the module.
This is explained in 7.3 of the DWARF spec:
The relocated addresses in the debugging informations for an executable
object are virtual addresses and the relocated addresses in the
debugging informations for a shared object are offsets relative
to the start of the lowest region of memory loaded from that
shared memory.
You should use e_type in the ELF header to distinguish them.

No sure if this is the best way, but ldw has a function, dwarf_getelf(), that can get you to the ELF information, from there use elf32/64_getehdr() and from there look at the e_type field. If e_type is ET_DYN then it's a shared object and you should carry on and use dladdr to find the offset to delete from addresses else just use the addresses generated by backtrace directly.

Related

MIPS functions and variables in stack

I have come in contact with MIPS-32, and I came with the question if a variable, for example $t0 declared having the value in one function can be altered by another and how this does have to do with stack, this is, the location of the variable in memory. Everything that I am talking is in assembly language. And more, I would like some examples concerning this use, this is, a function altering or not, a variable value of another function, and how this variable "survive" or not in terms of if the variable is given as a copy or a reference.
(I hope we can create an environment where conceptual question like that above can be explored more)
$t0 declared having the value in one function can be altered by another
$t0 is known as a call-clobbered register.  It is no different than the other registers as far as the hardware is concerned — being call clobbered vs. call preserved is an aspect of software convention, call the calling convention, which is a subset of an Application Binary Interface (ABI).
The calling convention, when followed, allows a function, F, to call another function, G, knowing only G's signature — name, parameters & their types, return type.  The function, F, would not have to also be changed if G changes, as long as both follow the convention.
Call clobbered doesn't mean it has to be clobbered, though, and when writing your own code you can use it any way you like (unless your coursework says to follow the MIPS32 calling convention, of course).
By the convention, a call-clobbered register can be used without worry: all you have to do use it is put a value into it!
Call preserved registers can also be used, if desired, but they should be presumed to be already in use by some caller (maybe not the immediate caller, but some distant caller), the values they contain must be restored before exiting the function back to return to its caller.  This is, of course, only possible by saving the original value before repurposing the register for a new use.
The two sets of register (call clobbered/preserved) serve two common use cases, namely cheap temporary storage and long term variables.  The former requires no effort to preserve/restore, while the latter both does require this effort, though gives us registers that will survive a function call, which is useful, for example, when a loop has a function call.
The stack comes into play when we need to first preserve, then restore call-preserved registers.  If we want to use call-preserved registers for some reason, then we need to preserve their original values in order to restore them later.  The most reasonable way to do that is to save them in the stack.  In order to do that we allocate some space from the stack.
To allocate some local memory, the stack pointer is decremented to reserve a function some space.  Since the stack pointer, upon return to caller, must have the same value, this space is necessarily deallocated upon return.  Hence the stack is great for local storage.  Original values of preserved registers must be also restored upon return to caller and so using local storage is appropriate.
https://www.dyncall.org/docs/manual/manualse11.html — search for section "MIPS32".
Let's also make the distinction between variables, a logical concept, and storage, a physical concept.
In high level language, variables are named and have scopes (limited lifetimes).  In machine code, we have physical hardware (storage) resources of registers and memory; these simply exist: they have no concept of lifetime.  In and of themselves these hardware resources are not variables, but places that we can use to hold variables for their lifetime/scope.
As assembly language programmers, we keep a mental (or even written) map of our logical variables to physical resources.  The compiler does the same, knowing the scope/lifetime of program variables and creating that "mental" map of variables to machine code storage.  Variables that have overlapping lifetimes cannot share the same hardware resource, of course, but when a variable is out of scope, its (mapped-to) physical resource can be reused for another purpose.
Logical variables can also move around to different physical resources.  A logical variable that is a parameter, may be passed in a CPU register, e.g. $a0, but then be moved into an $s register or into a (stack) memory location.  Such is the business of machine code.
To allocate some hardware storage to a high level language (or pseudo code) variable, we simply initialize the storage!  Hardware resources are necessarily constantly being repurposed to hold a different logical variable.
See also:
How a recursive function works in MIPS? — for discussion on variable analysis.
Mips/assembly language exponentiation recursivley
What's the difference between caller-saved and callee-saved in RISC-V

Function Call: Labels into memory addresses

I have difficulty to understand the correct sequence of events.
When a program written in abstract language is compiled, it is translated into machine code.
Subsequently, only after the program has run, it is loaded into the ram, in the code segment.
At this point, each instruction in the program will be on a specific memory address.
When a function is called in assembly, the Call statement is typically followed by a label.
I assume this label will be replaced with the function's memory address by the compiler.
And this is where I absolutely can't understand.
If the instructions are loaded into memory only when the program is running, thus obtaining each instruction its own memory address, how does the compiler know the memory address to which the label corresponds?
If the function is not yet in memory, how can the program, compiled in binary code where the labels are nomore available, know the memory address, corrisponding to that label, where the function will be loaded at the moment of execution? I am a bit confused. Help me.
A program contains several "sections" (some are optional):
a section that holds code, usually referred to as the Text section
a section that holds initial values for mutable global data
a section that holds immutable constants, usually called rodata
a section that has a set of relocation records
A section is stored as a contiguous chunk or block of memory in the program file on disc.
The loader creates memory chunks and loads the code, data, rodata into those; a stack will have been created, depending on the os, either by the loader, but also possibly by the forking of the parent process that creates the child process.
Knowing the final addresses, the loader also processes the relocation records.  These relocations describe where in the text and data sections updates are needed for the final addresses of the sections loaded into memory.
The relocation mechanism is general purpose, as: code can refer to code, code can refer to data, data can refer to code, and data can refer to data.
A single relocation record describes a reference that needs to be updated.  Each record describes:
a referring source — at what offset in the text or data section to make an address update
a referring target — which section is being referred to: code or data
what kind of update to make (some architectures have complex instruction encodings)
Some updates are for ordinary pointers, while others are for instructions.  Instruction set architectures that have complex instruction offset/immediate encodings, like MIPS, RISC V, HP-PA, need to inform of the immediate encoding method.
Usually the referrer already has an offset, so the update is a matter of addition/summation of the base of the section being referred to, to the offset already in place at the referring source.
Other metadata in the program describes where to start, e.g. the initial the program counter, which would be as an offset into the text section.
Most processors today support (as fuz describes) position independent code (PIC).  This is typically done via pc-relative addressing.  The processor performs branches and calls within the single text section using pc-relative addressing modes, and thus, no relocation records are required for these instructions.
Dynamically loaded libraries add complexity since each DLL, and the main program to run, each have the format of a program, i.e. they each will have their own sections; each has its own text section.  The relocations will also be capable of describing references to symbol imports, supported by additional sections holding symbol names, imports, and exports.
Object files (compiler output, pre-linking) typically follow this format as well.  A single object file has these sections, with relocation records, symbol names, imports, exports.  The linker's job is to merge object files into a single program or larger object file.  During merge the linker resolves some relocations, but it cannot necessarily resolve all of them, so some may remain for the os loader to resolve.
Let's imagine that, on a system using PIC, there is a reference: a call (code-to-code), from one object file to another, and that the linker merges these object files.  There will be a relocation record in the caller that refers to an imported symbol name (and in the other object file, an export of a symbol defined as some offset with its text section).  Once the two object files' sections are merged (e.g. by simply concatenating them into one larger text section), the call there is now an intra-section reference, and the linker can compute the delta between the addresses of the caller and callee, and these will not change by future linking or loading.  The linker will adjust the offset/immediate in the call instruction with that delta, and, knowing this reference is now resolved, omits this relocation record in the merge.
For reference, see:
ELF (Executable and Linkable Format)
COFF (Common Object File Format)
Windows Portable Executable Format
Relocation
Object file
Executable
Position Independent Code
Addressing Mode
TL:DR: the distance from a call to its target is a link-time constant.
The .o object files you get from assembling asm have relocation records for symbols that aren't defined in that file.
When you link these .o files into an executable or library, the linker lays out the .text section from each .o into one big .text section for the executable and calculates the relative distance for each call to reach its target. It encodes that relative displacement right into the machine code for each call.
At run time no further relocations are needed: wherever the whole executable is loaded in memory, distances between instructions don't change. So no runtime relocations are ever needed for relative calls.
Related: Why are global variables in x86-64 accessed relative to the instruction pointer?

Replacing a constant with a different value determined by the backend

Consider an architecture (namely, the Infocom Z-machine) with a large, readonly memory region (called "high memory") that is only intended to store strings (and machine code, but that doesn't pose a problem). This region can only be accessed by certain instructions that display text. Of course, this means that pointers to high memory can't be dereferenced.
I'd like to write an LLVM backend for this architecture. In order to do this, I need a way to tell the backend to store certain strings in high memory, and to obtain the "packed addresses" of said strings (also to convert the strings to the Z-Machine string encoding, but that's not the point).
Ideally, I'd be able to define a C function-like macro HIGHMEM_STRING which would take a string literal and expand to an integer constant. Supposing there's a function void print_paddr(uint16_t paddr), I'd like to be able to do:
print_paddr(HIGHMEM_STRING("It is pitch black. You are likely to be eaten by a grue."));
And then the backend would know to place the string in high memory and pass its packed address to print_paddr as a parameter.
My question has three parts:
Can I implement such a macro using LLVM intrinsics, or an asm block with a special directive for the backend, or some other similar way without having to fork Clang? Otherwise, what would I have to change in Clang?
How can I annotate the LLVM IR to convey to the backend that a string should be placed in high memory and replaced with its packed address?
If HIGHMEM_STRING is too hard, or impossible, to implement as a macro, what are the alternatives?
The Hexagon backend does something similar by storing information in a special section whos base address is loaded in the GP register and the referencing instruction has an offset inside that section. Look for CONST64 to get an idea of how these are processed.
Basically when we identify the data in LLVM IR we want to put in this special section, we create a pseudo instruction with the data. When we are writing out the ELF file we switch sections to the GP-rel section, emit the data, then switch back to the text section and emit the instruction to dereference this symbol.
It'll probably be easier if you can identify these strings based on their contents rather than having the user specify them in the program text.

How come the macro is used as a function, but is not implemented anywhere?

The following code is in MySQL 5.5 storage/example/ha_example.cc:
MYSQL_READ_ROW_START(table_share->db.str, table_share->table_name.str, TRUE);
rc= HA_ERR_END_OF_FILE;
MYSQL_READ_ROW_DONE(rc);
I search the MYSQL_READ_ROW_START definition in the whole project, and find it in the include/probes_mysql_nodtrace.h:
#define MYSQL_READ_ROW_START(arg0, arg1, arg2)
#define MYSQL_READ_ROW_START_ENABLED() (0)
#define MYSQL_READ_ROW_DONE(arg0)
#define MYSQL_READ_ROW_DONE_ENABLED() (0)
It is just an empty macro definition here.
My question is, How came this macro MYSQL_READ_ROW_START is not associate with any function, but used as a function in the above code?
Thanks.
These aren't traditional macros: they're probe points for DTrace,
an observability framework for Solaris, OS X, FreeBSD and various
other operating systems.
DTrace revolves around the notion that different providers offer
certain probes at which one can observe running executables or
even the operating system itself. Some providers are time-based;
by firing at regular intervals the probes can, for example, be
used to profile the use of a CPU. Other providers are code-based,
and their probes might, for example, fire at the entrance to and
exit from functions.
The code you highlight is an example of the USDT (User-land
Statically Defined Tracing) provider. The canonical use of the
USDT provider is to expose meaningful events within transactions.
For example, the beginning and end of a transaction might well
occur somewhere deep within different functions; in this case
it's best for the developer to identify exactly what he wants to
reveal and when.
A USDT probe is more than a switchable printf() although it can
of course be used to reveal information, e.g. some local value
such as the intermediate result of a transaction. A USDT probe
can also be used to trigger behaviour. For example, one might
want to activate some network probes for only the duration of a
certain transaction.
Returning to your question, USDT probes are implemented by writing
macros in the code that correspond to a description of the
provider in a ".d" file elsewhere. This is parsed by the
dtrace(1) utility, which generates a header file that is suitable
for compilation. On a system that lacks DTrace it would make
sense to define a header file in which the USDT macros became null
ops, and judging by the given filename (probes_mysql_nodtrace.h)
this is what you are observing.
See http://dev.mysql.com/tech-resources/articles/getting_started_dtrace_saha.html.
To quote:
DTrace probes are implemented by kernel modules called providers, each
of which performs a particular kind of instrumentation to create
probes. Providers can thus described as publishers of probes that can
be consumed by DTrace consumers (see below). Providers can be used for
instrumenting kernel and user-level code. For user-level code, there
are two ways in which probes can be defined- User-Level Statically
Defined Tracing (USDT) or PID provider.
So it appears to be up to DTrace providers to implement such a macro.

cudaMemcpy() vs cudaMemcpyFromSymbol()

I'm trying to figure out why cudaMemcpyFromSymbol() exists. It seems everything that 'symbol' func can do, the nonSymbol cmds can do.
The symbol func appears to make it easy for part of an array or index to be moved, but this could just as easily be done with the nonSymbol function. I suspect the nonSymbol approach will run faster as there is no symbol-lookup needed. (It is not clear if the symbol look up calculation is done at compile or run time.)
Why would I use cudaMemcpyFromSymbol() vs cudaMemcpy()?
cudaMemcpyFromSymbol is the canonical way to copy from any statically defined variable in device memory.
cudaMemcpy can't be directly use to copy to or from a statically defined device variable because it requires a device pointer, and that isn't known to host code at runtime. Therefore, an API call which can interrogate the device context symbol table is required. The two choices are either, cudaMemcpyFromSymbol which does the symbol lookup and copy in one operation, or cudaGetSymbolAddress which returns an address which can be passed to cudaMemcpy. The former is probably more efficient if you only want to do one copy, the latter if you want to use the address multiple times in host code.