I know that illegal operation (such as write on an read-only area) on the memory would cause a data abort exception on ARM CPU. But I wander whether operations on the stack memory cause a data abort exception.
For example, I set the sp register points to a read-only memory, and I use a push instruction to store some value on the stack. Would the push operation cause a data abort exception?
Related
I am developing an UEFI application for ARM64 (ARMv8-A) and I have come across the issue: "Synchronous Exceptions at 0xFF1BB0B8."
This value (0x0FF1BB0B8) is exception link register (ELR). ELR holds the exception return address.
There are a number of sources of Synchronous exceptions (https://developer.arm.com/documentation/den0024/a/AArch64-Exception-Handling/Synchronous-and-asynchronous-exceptions):
Instruction aborts from the MMU. For example, by reading an
instruction from a memory location marked as Execute Never.
Data Aborts from the MMU. For example, Permission failure or alignment checking.
SP and PC alignment checking.
Synchronous external aborts. For example, an abort when reading translation table.
Unallocated instructions.
Debug exceptions.
I can't update BIOS firmware to output more debug information. Is there any way to detect more precisely what causes Synchronous Exception?
Can I use the value of ELR (0x0FF1BB0B8) to locate the issue? I compile with -fno-pic -fno-pie options.
Taken from the Intel 80386 Programmer's Reference Manual, CH9:
Exceptions are classified as faults, traps, or aborts depending on the way
they are reported and whether restart of the instruction that caused the
exception is supported.
Aborts: An abort is an exception that permits neither precise location
of the instruction causing the exception nor restart of the program
that caused the exception. Aborts are used to report severe errors,
such as hardware errors and inconsistent or illegal values in system
tables.
(Emphasis mine)
What does it exactly mean that aborts are exceptions that do not permit precise location of the instruction causing the exception?
There's only 2 "abort" exceptions:
a) Double fault exception. This happens when trying to start one exception caused a second exception; where the instruction pointer (saved by CPU when starting the double fault exception handler) could be from the original instruction, the first exception, the second, something else or nothing. In this case, because the CPU couldn't start the first or second exception handler you can't return from the double fault to the second (or first) exception handler anyway.
b) Machine check exception. This is purely for hardware faults where you probably don't want to assume that memory, or caches, or the CPU actually work. You can't expect "guaranteed behavior" from faulty hardware.
Note 1: Technically, for some causes of machine check, you can return to the precise location of whatever happened to be interrupted by the exception; you just need to be very careful about determining if you can/can't return (and need some way to fix/work-around the hardware problem/s so that returning from the machine check exception handler won't just trigger a second machine check exception).
Note 2: It's perfectly possible for the exceptions preceding a double fault, or machine check, to be caused by something that isn't an instruction at all (e.g. caused by an IRQ). In these cases "the precise location of the instruction causing the exception" would be "the precise location of something that doesn't exist". Intel's words should be interpreted as being conditional (like "an exception that permits neither precise location of the instruction causing the exception if there is one, nor restart of the program that caused the exception if there is one").
I have read that some CPUs will produce an exception if they tried to access an unaligned data.
Based on a testing I made, an x86 CPU did not produce an exception when trying to access an unaligned data, but I am wondering is there a situation where an unaligned data will cause an x86 CPU to produce an exception?
On x86, if you set the AM flag in the CR0 register and set the AC flag in the EFLAGS register, then any unaligned memory access at CPL 3 (user priv level) will cause an #AC exception (interrupt 17). Since normally these bits are clear, and access to them is privileged you'd need to go to some effort to enable them (which might be impossible on some OSes).
I've been studying assembly lately and i can't seem to understand how the exceptions work exactly. More specific, i get the message Exception 6 occurred and ignored. Can someone please explain what exactly does this mean? I am using qtspim.
Exceptions may be caused by hardware or software. An exception is like an unscheduled function call that jumps to a new address.
The program may encounter an error condition such as
an undefined instruction. The program then jumps to code in the operating system (OS), which may choose to terminate the program. Other causes of exceptions are division by zero, attempts to read some nonexistent memory, hardware malfunctions, debugger breakpoints, and arithmetic overflow.
The processor records the cause of an exception and the value of the PC
at the time the exception occurs. It then jumps to the exception handler function. The exception handler is code (usually in the OS) that examines the
cause of the exception and responds appropriately, It then returns to the program that
was executing before the exception took place.
In MIPS, the exception handler is always located at 0x80000180. When an exception occurs, the processor always jumps to this instruction address, regardless of the cause.
The MIPS architecture uses a special-purpose register, called the Cause
register, to record the cause of the exception.
MIPS uses another special-purpose register called the Exception
Program Counter (EPC) to store the value of the PC at the time an
exception takes place. The processor returns to the address in EPC after
handling the exception. This is analogous to using $ra to store the old
value of the PC during a jal instruction.
CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.