Looking for a _one byte_ invalid opcode with x86 - exception

I need an invalid opcode with x86 (not x64!) that's exactly one byte in length to overwrite some code in a foreign process. Currently I'm using INT3 (0xCC) but it would be nicer to trap an invalid opcode separately since the foreign process contains a lot of valid INT3.

According to http://ref.x86asm.net/coder32.html, there aren't any in 32-bit mode guaranteed to #UD. Anything that wasn't nailed down has been reused as building material for new extensions.
The ones that exist in 64-bit mode are reserved and not guaranteed to fault on future CPUs; only ud2 is truly guaranteed future-proof. Assuming x86-64 lasts long enough, likely some vendor will make use of that 64-bit-only coding space and stop wasting code-size to also cater to increasingly obsolete 32-bit mode.
If you don't need #UD, you can raise #GP(0) with some privileged instructions in user-space, assuming you're never going to be running in kernel mode.
F4 hlt will always #GP(0) in user-space, not enabled by IOPL, only true CPL=0. (Or #UD if used with a lock prefix). Even if it somehow gets executed in a kernel context, it just stops and waits for the next interrupt, so typically no effect on correctness unless executed with interrupts disabled. (In which case you're stuck until the next NMI).
A similar but worse option is FB sti. But it can execute successfully in a program that's used Linux iopl(), like an X11 server. Unless interrupts were supposed to be disabled, though, that's still not going to lock up your system, it just won't trigger the exception you were looking for. (Unlike cli which could get that CPU stuck, or in al, dx which could do wild IO and even be allowed by ioperm not just iopl, depending on what value is in DX.)
Depending what comes next in memory, 9A callf ptr16:32 might fault on trying to load an invalid value into CS. That value would come from the 2 bytes of machine code 5 and 6 bytes after this one (i.e. after a 32-bit new EIP, since ptr16:32 is stored little-endian). Unlike call rel32 or whatever, it may fault before actually pushing anything and overwriting the current CS:EIP. (But if not, in theory your debugger could simulate popping that far-return address back into CS:EIP after catching the fault.)
Just to be clear, I'm suggesting overwriting a byte with 9A, and leaving the later bytes of machine code unmodified, after checking that the bytes that would be the new CS value are in fact invalid. e.g. by making sure a far call to that address segfaults. Or if this is near the end of a page, and the next is unmapped, it can #PF.
The F0 lock prefix faults with #UD if used on things other than a memory-destination RMW operation, so it can also work if later context would decode as any other instruction. But you can't always use it; you need to check that you aren't creating a valid atomic RMW instruction. e.g. if the ModRM byte was 00 or 01, replacing the opcode with a lock prefix creates a memory-destination add.
#ecm points out that f1 on some CPUs is icebp / int1, but on other CPUs where it isn't, it's undefined but doesn't raise #UD. (http://ref.x86asm.net/coder32.html#xF1)
If the following byte is 0, D4 00 aam 0 is guaranteed to #DE (divide exception). But any other value does immediate 8-bit division of AL.
Depending what byte comes next, CD int n can be used. But not for all following bytes, e.g. int 0x80 won't fault under Linux (unless your kernel is built without CONFIG_IA32_EMULATION). And you might not want some of the other random interrupt numbers. e.g. CD 03 int 3 is pretty much like CC int3.

Related

When do we create base pointer in a function - before or after local variables?

I am reading the Programming From Ground Up book. I see two different examples of how the base pointer %ebp is created from the current stack position %esp.
In one case, it is done before the local variables.
_start:
# INITIALIZE PROGRAM
subl $ST_SIZE_RESERVE, %esp # Allocate space for pointers on the
# stack (file descriptors in this
# case)
movl %esp, %ebp
The _start however is not like other functions, it is the entry point of the program.
In another case it is done after.
power:
pushl %ebp # Save old base pointer
movl %esp, %ebp # Make stack pointer the base pointer
subl $4, %esp # Get room for our local storage
So my question is, do we first reserve space for local variables in the stack and create the base pointer or first create the base pointer and then reserve space for local variables?
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc. Does C have a specific convention when it creates the machine code?
My reasoning is that all the code in a function would be relative to the base pointer, so as long as that function follows the convention according to which it created a reference of the stack, it just works?
Few related links for those are interested:
Function Prologue
In your first case you don't care about preservation - this is the entry point. You are trashing %ebp when you exit the program - who cares about the state of the registers? It doesn't matter any more as your application has ended. But in a function, when you return from that function the caller certainly doesn't want %ebp trashed. Now can you modify %esp first then save %ebp then use %ebp? Sure, so long as you unwind the same way on the other end of the function, you may not need to have a frame pointer at all, often that is just a personal choice.
You just need a relative picture of the world. A frame pointer is usually just there to make the compiler author's job easier, actually it is usually there just to waste a register for many instruction sets. Perhaps because some teacher or textbook taught it that way, and nobody asked why.
For coding sanity, the compiler author's sanity etc, it is desirable if you need to use the stack to have a base address from which to offset into your portion of the stack, FOR THE DURATION of the function. Or at least after the setup and before the cleanup. This can be the stack pointer(sp) itself or it can be a frame pointer, sometimes it is obvious from the instruction set. Some have a stack that grows down (in address space toward zero) and the stack pointer can only have positive offsets in sp based address (sane) or some negative only (insane) (unlikely but lets say its there). So you may want a general purpose register. Maybe there are some you cant use the sp in addressing at all and you have to use a general purpose register.
Bottom line, for sanity you want a reference point to offset items in the stack, the more painful way but uses less memory would be to add and remove things as you go:
x is at sp+4
push a
push b
do stuff
x is at sp+12
pop b
x is at sp+8
call something
pop a
x is at sp+4
do stuff
More work but can make a program (compiler) that keeps track and is less error prone than a human by hand, but when debugging the compiler output (a human) it is harder to follow and keep track. So generally we burn the stack space and have one reference point. A frame pointer can be used to separate the incoming parameters and the local variables using base pointer(bp) for example as a static base address within the function and sp as the base address for local variables (athough sp could be used for everything if the instruction set provides that much of an offset). So by pushing bp then modifying sp you are creating this two base address situation, sp can move around perhaps for local stuff (although not usually sane) and bp can be used as a static place to grab parameters if this is a calling convention that dictates all parameters are on the stack (generally when you dont have a lot of general purpose registers) sometimes you see the parameters are copied to local allocation on the stack for later use, but if you have enough registers you may see that instead a register is saved on the stack and used in the function instead of needing to access the stack using a base address and offset.
unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int x )
{
unsigned int y;
y = x;
return(more_fun(x+1)+y);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e2800001 add r0, r0, #1
c: ebfffffe bl 0 <more_fun>
10: e0800004 add r0, r0, r4
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Do not take what you see in a text book, white board (or on answers in StackOverflow) as gospel. Think through the problem, and through alternatives.
Are the alternatives functionally broken?
Are they functionally correct?
Are there disadvantages like readability?
Performance?
Is the performance hit universal or does it depend on just how
slow/fast the memory is?
Do the alternatives generate more code which is a performance hit but
maybe that code is pipelined vs random memory accesses?
If I don't use a frame pointer does the architecture let me regain
that register for general purpose use?
In the first example bp is being trashed, that is bad in general but this is the entry point to the program, there is no need to preserve bp (unless the operating system dictates).
In a function though, based on the calling convention one assumes that bpis used by the caller and must be preserved, so you have to save it on the stack to use it. In this case it appears to want to be used to access parameters passed in by the caller on the stack, then sp is moved to make room for (and possibly access but not necessarily required if bp can be used) local variables.
If you were to modify sp first then push bp, you would basically have two pointers one push width away from each other, does that make much sense? Does it make sense to have two frame pointers anyway and if so does it make sense to have them almost the same address?
By pushing bp first and if the calling convention pushes the first paramemter last then as a compiler author you can make bp+N always or ideally always point at the first parameter for a fixed value N likewise bp+M always points at the second. A bit lazy to me, but if the register is there to be burned then burn it...
In one case, it is done before the local variables.
_start is not a function. It's your entry point. There's no return address, and no caller's value of %ebp to save.
The i386 System V ABI doc suggests (in section 2.3.1 Initial Stack and Register State) that you might want to zero %ebp to mark the deepest stack frame. (i.e. before your first call instruction, so the linked list of saved ebp values has a NULL terminator when that first function pushes the zeroed ebp. See below).
Does C have a specific convention when it creates the machine code?
No, unlike in some other x86 systems, the i386 System V ABI doesn't require much about your stack-frame layout. (Linux uses the System V ABI / calling convention, and the book you're using (PGU) is for Linux.)
In some calling conventions, setting up ebp is not optional, and the function entry sequence has to push ebp just below the return address. This creates a linked list of stack frames which allows an exception handler (or debugger) to backtrace up the stack. (How to generate the backtrace by looking at the stack values?). I think this is required in 32-bit Windows code for SEH (structured exception handling), at least in some cases, but IDK the details.
The i386 SysV ABI defines an alternate mechanism for stack unwinding which makes frame pointers optional, using metadata in another section (.eh_frame and .eh_frame_hdr which contains metadata created by .cfi_... assembler directives, which in theory you could write yourself if you wanted stack-unwinding through your function to work. i.e. if you were calling any C++ code which expected throw to work.)
If you want to use the traditional frame-walking in current gdb, you have to actually do it yourself by defining a GDB function like gdb backtrace by walking frame pointers or Force GDB to use frame-pointer based unwinding. Or apparently if your executable has no .eh_frame section at all, gdb will use the EBP-based stack-walking method.
If you compile with gcc -fno-omit-frame-pointer, your call stack will have this linked-list property, because when C compilers do make proper stack frames, they push ebp first.
IIRC, perf has a mode for using the frame-pointer chain to get backtraces while profiling, and apparently this can be more reliable than the default .eh_frame stuff for correctly accounting which functions are responsible for using the most CPU time. (Or causing the most cache misses, branch mispredicts, or whatever else you're counting with performance counters.)
Wouldn't both just work even if I mix them up in different functions of a program? One function does it before, the other does it after etc.
Yes, it would work fine. In fact setting up ebp at all is optional, but when writing by hand it's easier to have a fixed base (unlike esp which moves around when you push/pop).
For the same reason, it's easier to stick to the convention of mov %esp, %ebp after one push (of the old %ebp), so the first function arg is always at ebp+8. See What is stack frame in assembly? for the usual convention.
But you could maybe save code size by having ebp point in the middle of some space you reserved, so all the memory addressable with an ebp + disp8 addressing mode is usable. (disp8 is a signed 8-bit displacement: -128 to +124 if we're limiting to 4-byte aligned locations). This saves code bytes vs. needing a disp32 to reach farther. So you might do
bigfunc:
push %ebp
lea -112(%esp), %ebp # first arg at ebp+8+112 = 120(%ebp)
sub $236, %esp # locals from -124(%ebp) ... 108(%ebp)
# saved EBP at 112(%ebp), ret addr at 116(%ebp)
# 236 was chosen to leave %esp 16-byte aligned.
Or delay saving any registers until after reserving space for locals, so we aren't using up any of the locations (other than the ret addr) with saved values we never want to address.
bigfunc2: # first arg at 4(%esp)
sub $252, %esp # first arg at 252+4(%esp)
push %ebp # first arg at 252+4+4(%esp)
lea 140(%esp), %ebp # first arg at 260-140 = 120(%ebp)
push %edi # save the other call-preserved regs
push %esi
push %ebx
# %esp is 16-byte aligned after these pushes, in case that matters
(Remember to be careful how you restore registers and clean up. You can't use leave because esp = ebp isn't right. With the "normal" stack frame sequence, you might restore other pushed registers (from near the saved EBP) with mov, then use leave. Or restore esp to point at the last push (with add), and use pop instructions.)
But if you're going to do this, there's no advantage to using ebp instead of ebx or something. In fact, there's a disadvantage to using ebp: the 0(%ebp) addressing mode requires a disp8 of 0, instead of no displacement, but %ebx wouldn't. So use %ebp for a non-pointer scratch register. Or at least one that you don't dereference without a displacement. (This quirk is irrelevant with a real frame pointer: (%ebp) is the saved EBP value. And BTW, the encoding that would mean (%ebp) with no displacement is how the ModRM byte encodes a disp32 with no base register, like (12345) or my_label)
These example are pretty artifical; you usually don't need that much space for locals unless it's an array, and then you'd use indexed addressing modes or pointers, not just a disp8 relative to ebp. But maybe you need space for a few 32-byte AVX vectors. In 32-bit code with only 8 vector registers, that's plausible.
AVX512 compressed disp8 mostly defeats this argument for 64-byte AVX512 vectors, though. (But AVX512 in 32-bit mode can still only use 8 vector registers, zmm0-zmm7, so you could easily need to spill some. You only get x/ymm8-15 and zmm8-31 in 64-bit mode.)

Merit of inline-ASM rounding via putting float into int variable

I have inherited a pretty interesting piece of code:
inline int round(float a)
{
int i;
__asm {
fld a
fistp i
}
return i;
}
My first impulse was to discard it and replace calls with (int)std::round (pre-C++11, would use std::lround if it happened today), but after a while I started to wonder if it might have some merit after all...
The use case for this function are all values in [-100, 100], so even int8_t would be wide enough to hold the result. fistp requires at least a 32 bit memory variable, however, so less than int32_t is just as wasted as more.
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards. C++11 offers the std::lround function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.
On the other hand, with inline-ASM in the function, the compiler cannot optimise away i into a register (and even if it could, fistp expects a memory variable), so std::lround does not seem too much worse...
The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest, as it obviously does (no checks). As std::lround has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.
It is furthermore highly unclear to me whether the rounding mode set by std::fesetround and used by the std::lround alternative std::lrint and the rounding mode employed in the fistp ASM-instruction are guaranteed to be the same or at least synchronous.
These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.
Now to the questions:
Following a more informed view of these considerations or such which I have not thought of, does it seem advisable to use this function?
How great is the risk, if any?
Does reasoning exist for why it would not be faster than std::lround or std::lrint?
Can it be further improved without performance cost?
Does any of this reasoning change if the program were compiled for x86-64?
TL;DR: use lrintf(x) or (int)nearbyintf(x), depending on which one your compiler likes better.
Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem or penryn, or later), with or without -ffast-math. You may need -fno-math-errno to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf or sqrtf or other math functions to set errno, and is generally recommended along with -fno-trapping-math.
Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld a constant and fistp it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).
The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.
In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.
Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno can help, if you can't use the full -ffast-math or the MSVC equivalent)
With some compilers (gcc but not clang), lrintf is good. It isn't ideal, though: float->long->int isn't the same as directly to int when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long.
64bit long changes the overflow semantics for lrint: instead of getting 0x80000000 (on x86 with SSE instructions), you'll get the low 32bits of the long (which will be all-zero if the value was outside the range of a long).
This lrintf won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert floats or double to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int, but you can use (int)nearbyintf(x), which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.
Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0 on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.
On AArch64 (aka ARM64), gcc4.8 compiles lround into a single fcvtas x0, s0 instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a. Maybe those aren't the right options; IDK ARM very well :/
If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32 (cvtps2dq), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.
lrintf
#include <math.h>
int round_to_nearest(float f) { // default mode is always nearest
return lrintf(f);
}
Compiler output from the Godbolt Compiler explorer:
########### Without -ffast-math #############
cvtss2si eax, xmm0 # gcc 6.1 (-O3 -mx32, so long is 32bit)
cvtss2si rax, xmm0 # gcc 4.4 through 6.1 (-O3). can't auto-vectorize, though.
jmp lrintf # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/
###### With -ffast-math #########
jmp lrintf # clang 3.8 (-O3 -msse4.1 -ffast-math)
So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math.
Don't use roundf/lroundf: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.
If you want the return value as a float, instead of converting to int, it may be better to use nearbyintf. rint has to raise the FP inexact exception when output != input. (But SSE4.1 roundss can implement either behaviour with bit 3 of its immediate control byte).
truncating nearbyint() to int directly.
#include <math.h>
int round_to_nearest(float f) {
return nearbyintf(f);
}
Compiler output from the Godbolt Compiler explorer.
######## With -ffast-math ############
cvtss2si eax, xmm0 # gcc 4.8 through 6.1 (-O3 -ffast-math)
# clang is dumb and won't fold the roundss into the cvt. Without sse4.1, it's a function call
roundss xmm0, xmm0, 12 # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1 -ffast-math)
cvtss2si eax, xmm1
######## WITHOUT -ffast-math ############
sub rsp, 8
call nearbyintf # gcc 6.1 (-O3 -msse4.1)
add rsp, 8 # and clang without -msse4.1
cvttss2si eax, xmm0
roundss xmm0, xmm0, 12 # clang3.2 and later (-O3 -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1)
cvtss2si eax, xmm1
Gcc 4.7 and earlier: Just cvttss2si without -msse4.1, but emits a roundss if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.
How it works in asm
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.
That's only true if you're targeting 20-year-old CPUs without SSE. (You said float, not double, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).
Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si) or with the current counting mode (cvtss2si). (Note the extra t for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double, and x86-64 allows the destination to be a 64bit integer register.
See also the x86 tag wiki.
cvtss2si basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.
I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float / double args in xmm registers).

How to diag imprecise bus fault after config of priority bit allocation, Cortex M3 STM32F10x w uC/OS-III

I have an issue in an app written for the ST Microelectronics STM32F103 (ARM Cortex-M3 r1p1). RTOS is uC/OS-III; dev environment is IAR EWARM v. 6.44; it also uses the ST Standard Peripheral Library v. 1.0.1.
The app is not new; it's been in development and in the field for at least a year. It makes use of two UARTs, I2C, and one or two timers. Recently I decided to review interrupt priority assignments, and I rearranged priorities as part of the review (things seemed to work just fine).
I discovered that there was no explicit allocation of group and sub-priority bits in the initialization code, including the RTOS, and so to make the app consistent with another app (same product, different processor) and with the new priority scheme, I added a call to NVIC_PriorityGroupConfig(), passing in NVIC_PriorityGroup_2. This sets the PRIGROUP value in the Application Interrupt and Reset Control Register (AIRCR) to 5, allocating 2 bits for group (preemption) priority and 2 bits for subpriority.
After doing this, I get an imprecise bus fault exception on execution, not immediately but very quickly thereafter. (More on where I suspect it occurs in a moment.) Since it's imprecise (BFSR.IMPRECISERR asserted), there's nothing of use in BFAR (BFSR.BFARVALID clear).
The STM32F family group implements 4 bits of priority. While I've not found this mentioned explicitly anywhere, it's apparently the most significant nybble of the priority. This assumption seems to be validated by the PRIGROUP table given in documentation (p. 134, STM32F10xxx/20xxx/21xxx/L1xxx Cortex-M3 Programming Manual (Doc 15491, Rev 5), sec. 4.4.5, Application interrupt and control register (SCB_AIRCR), Table 45, Priority grouping, p. 134).
In the ARM scheme, priority values comprise some number of group or preemption priority bits and some number of subpriority bits. Group priority bits are upper bits; subpriority are lower. The 3-bit AIRCR.PRIGROUP value controls how bit allocation for each are defined. PRIGROUP = 0 configures 7 bits of group priority and 1 bit of subpriority; PRIGROUP = 7 configures 0 bits of group priority and 8 bits of subpriority (thus priorities are all subpriority, and no preemption occurs for exceptions with settable priorities).
The reset value of AIRCR.PRIGROUP is defined to be 0.
For the STM32F10x, since only the upper 4 bits are implemented, it seems to follow that PRIGROUP = 0, 1, 2, 3 should all be equivalent, since they all correspond to >= 4 bits of group priority.
Given that assumption, I also tried calling NVIC_PriorityGroupConfig() with a value of NVIC_PriorityGroup_4, which corresponds to a PRIGROUP value of 3 (4 bits group priority, no subpriority).
This change also results in the bus fault exception.
Unfortunately, the STM32F103 is, I believe, r1p1, and so does not implement the Auxiliary Control Register (ACTLR; introduced in r2p0), so I can't try out the DISDEFWBUF bit (disables use of the write buffer during default memory map accesses, making all bus faults precise at the expense of some performance reduction).
I'm almost certain that the bus fault occurs in an ISR, and most likely in a UART ISR. I've set a breakpoint at a particular place in code, started the app, and had the bus fault before execution hit the breakpoint; however, if I step through code in the debugger, I can get to and past that breakpoint's location, and if I allow it to execute from there, I'll see the bus fault some small amount of time after I continue.
The next step will be to attempt to pin down what ISR is generating the bus fault, so that I can instrument it and/or attempt to catch its invocation and step through it.
So my questions are:
1) Anyone have any suggestions as to how to go about identifying the origin of imprecise bus fault exceptions more intelligently?
2) Why would setting PRIGROUP = 3 change the behavior of the system when PRIGROUP = 0 is the reset default? (PRIGROUP=0 means 7 bits group, 1 bit sub priority; PRIGROUP=3 means 4 bits group, 4 bits sub priority; STM32F10x only implements upper 4 bits of priority.)
Many, many thanks to everyone in advance for any insight or non-NULL pointers!
(And of course if I figure it out beforehand, I'll update this post with any information that might be useful to others encountering the same sort of scenario.)
Even if BFAR is not valid, you can still read other related registers within your bus-fault ISR:
void HardFault_Handler_C(unsigned int* hardfault_args)
{
printf("R0 = 0x%.8X\r\n",hardfault_args[0]);
printf("R1 = 0x%.8X\r\n",hardfault_args[1]);
printf("R2 = 0x%.8X\r\n",hardfault_args[2]);
printf("R3 = 0x%.8X\r\n",hardfault_args[3]);
printf("R12 = 0x%.8X\r\n",hardfault_args[4]);
printf("LR = 0x%.8X\r\n",hardfault_args[5]);
printf("PC = 0x%.8X\r\n",hardfault_args[6]);
printf("PSR = 0x%.8X\r\n",hardfault_args[7]);
printf("BFAR = 0x%.8X\r\n",*(unsigned int*)0xE000ED38);
printf("CFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED28);
printf("HFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED2C);
printf("DFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED30);
printf("AFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED3C);
printf("SHCSR = 0x%.8X\r\n",SCB->SHCSR);
while (1);
}
If you can't use printf at the point in the execution when this specific Hard-Fault interrupt occurs, then save all the above data in a global buffer instead, so you can view it after reaching the while (1).
Here is the complete description of how to connect this ISR to the interrupt vector (although, as I understand from your question, you already have it implemented):
Jumping from one firmware to another in MCU internal FLASH
You might be able to find additional information on top of what you already know at:
http://www.keil.com/appnotes/files/apnt209.pdf

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

get wrong epc on MIPS

I know MIPS would get wrong epc register value when it happens at branch delay, and epc = fault_address - 4.
But now, I often get the wrong EPC value which is even NOT in .text segment such as 0xb6000000, what's wrong with the case??
Thanks for your advance..
The CPU does not know anything about the boundaries of the .text region in your program. It simply implements a 2^32 byte address space.
It is possible for an incorrectly programmed jump to go to any address within the 2^32 byte address space. The jump instruction itself will not cause any sort of exception - in fact the MIPS32® Architecture for Programmers Volume II: The MIPS32® Instruction Set explicitly states that jump (J, JR, JALR) instructions do not trigger any exceptions.
When the processor starts executing from the destination of an incorrectly programmed jump, in presumably uninitialized memory, what happens next depends on the contents of that memory. If uninitialized memory is filled with "random" data, that data will be interpreted as instructions which the processor will execute until an illegal instruction is found, or until an instruction triggers some other exception.