page fault,shortage of page or access violation? - exception

It is known that when access a page which does not exist in the memory can lead to a page fault, but writing a read-only page can also cause a page fault? How to identify the two types of page fault in exception handler?

You read the exception error code that the CPU places on the stack before invoking your page fault handler. This error code contains 5 bits, of which you're interested in these 4:
P=0: The fault was caused by a non-present page.
P=1: The fault was caused by a page-level protection violation.
W/R=0: The access causing the fault was a read.
W/R=1: The access causing the fault was a write.
U/S=0: The access causing the fault originated when the processor
was executing in supervisor mode.
U/S=1: The access causing the fault originated when the processor
was executing in user mode.
I/D=0: The fault was not caused by an instruction fetch.
I/D=1: The fault was caused by an instruction fetch.
If you get P=0, the page isn't present.
If you get P=1, the privileges are insufficient to access the page. U/S tells you if it's in the kernel or application. I/D tells you if it's because of code instruction reading or not (reading/writing data). W/R tells you if it is reading or writing that can't be done.
This is described in the Interrupt 14—Page-Fault Exception (#PF) section of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3: System Programming Guide.

Alex's answer is perfectly correct, however you also need to combine that information with some information of your own (i.e. by looking at the memory manager data). For example some operating systems don't allocate pages backing memory until they're referenced for the first time, so if you get a read or a write to a page which is not present you may find that the reason it is not present is that you haven't allocated it yet and you should allocate it and continue from the exception. Similarly a write to a read only page can occur as part of a copy-on-write mechanism (a number of systems do this, most notably posix style systems when performing fork()), so you detect the write to a read only page, check the memory manager tables and see the page should be copied, copy the page, update the page tables and continue.
I've found that usually the only flag from the list Alex mentions that is interesting is the one that says whether it was a read or a write. Beyond that you need to check everything else from the MM tables anyway.

Trying to write to read only will usually cause a segmentation fault (SIGSEGV).
http://en.wikipedia.org/wiki/Segmentation_fault
I think its called an access violation exception (memory access violation) in x86 parlance.

Related

Synchronous External Abort on ARM

I was building a bare metal application on ARM Cortex A9 Pandaboard, and I got Instruction Fetch Abort frequently. When I dump IFSR Register I got 0x1008. I've read the reference manual, and I understand that 1008 was Synchronous External Abort. The problem is what synchronous external abort means and where does it come from? Thanks for your help.
The ARMv7 ARM section "VMSA Memory aborts" covers this as thoroughly as one would expect (given that it's the authoritative definition of the architecture), but to summarise in slightly less than 14 pages;
An abort means the CPU tried to make a memory access, which for whatever reason, couldn't be completed so raises an exception.
An external abort is one from, well, externally to the processor, i.e. something on the bus. In other words, the access didn't fault in the MMU, went out onto the bus, and either some device or the interconnect itself came back and said "hey, I can't deal with this".
A synchronous external abort means you're rather fortunate, in that it's not going to be utterly hideous to debug - in the case of a prefetch abort, it means the IFAR is going to contain a valid VA for the faulting instruction, so you know exactly what caused it. The unpleasant alternative is an asynchronous external abort, which is little more than an interrupt to say "hey, something you did a while ago didn't actually work. No I don't know what is was either."
So, you're trying to execute instructions from something that you think is memory, but isn't. Without any further details, the actual cause could be anything from a typoed hard-coded address, to dodgy page tables, stale TLB entries, cache coherency, etc. etc.

ARM Asynchronous external abort

I am programming a bare-metal application on a Cortex-A9 and I am frequently getting Data Abort Exceptions. When I look for the reason for this exception in the Data Fault Status Register (DFSR) of CP15, the value of the Fault Status bits (FS) is b10110. I looked at the specification and b10110 means "Asynchronous external abort". What does this mean? I can't find any useful information about this kind of abort.
For example I also get Alignment faults sometimes, but I know what this means and so I can track down those kinds of faults comparatively easy. But I don't know how to handle asynchronous external aborts, since I don't know what they mean or why they occur. Thanks for your help.
Wild guess... You have unaligned writes, they get buffered and core moves on to subsequential instructions, writes start to happen and fails. Core has no idea where they are originated from, fails with an async data abort.
Reading Chapter 11.1 Types of Exceptions, Aborts from Cortex-A Series Programmers Guide might give you some ideas.

OS development: How to avoid an infinite loop after an exception routine

For some months I've been working on a "home-made" operating system.
Currently, it boots and goes into 32-bit protected mode.
I've loaded the interrupt table, but haven't set up the pagination (yet).
Now while writing my exception routines I've noticed that when an instruction throws an exception, the exception routine is executed, but then the CPU jumps back to the instruction which threw the exception! This does not apply to every exception (for example, a div by zero exception will jump back to the instruction AFTER the division instruction), but let's consider the following general protection exception:
MOV EAX, 0x8
MOV CS, EAX
My routine is simple: it calls a function that displays a red error message.
The result: MOV CS, EAX fails -> My error message is displayed -> CPU jumps back to MOV CS -> infinite loop spamming the error message.
I've talked about this issue with a teacher in operating systems and unix security.
He told me he knows Linux has a way around it, but he doesn't know which one.
The naive solution would be to parse the throwing instruction from within the routine, in order to get the length of that instruction.
That solution is pretty complex, and I feel a bit uncomfortable adding a call to a relatively heavy function in every affected exception routine...
Therefore, I was wondering if the is another way around the problem. Maybe there's a "magic" register that contains a bit that can change this behaviour?
--
Thank you very much in advance for any suggestion/information.
--
EDIT: It seems many people wonder why I want to skip over the problematic instruction and resume normal execution.
I have two reasons for this:
First of all, killing a process would be a possible solution, but not a clean one. That's not how it's done in Linux, for example, where (AFAIK) the kernel sends a signal (I think SIGSEGV) but does not immediately break execution. It makes sense, since the application can block or ignore the signal and resume its own execution. It's a very elegant way to tell the application it did something wrong IMO.
Another reason: what if the kernel itself performs an illegal operation? Could be due to a bug, but could also be due to a kernel extension. As I've stated in a comment: what should I do in that case? Shall I just kill the kernel and display a nice blue screen with a smiley?
That's why I would like to be able to jump over the instruction. "Guessing" the instruction size is obviously not an option, and parsing the instruction seems fairly complex (not that I mind implementing such a routine, but I need to be sure there is no better way).
Different exceptions have different causes. Some exceptions are normal, and the exception only tells the kernel what it needs to do before allowing the software to continue running. Examples of this include a page fault telling the kernel it needs to load data from swap space, an undefined instruction exception telling the kernel it needs to emulate an instruction that the CPU doesn't support, or a debug/breakpoint exception telling the kernel it needs to notify a debugger. For these it's normal for the kernel to fix things up and silently continue.
Some exceptions indicate abnormal conditions (e.g. that the software crashed). The only sane way of handling these types of exceptions is to stop running the software. You may save information (e.g. core dump) or display information (e.g. "blue screen of death") to help with debugging, but in the end the software stops (either the process is terminated, or the kernel goes into a "do nothing until user resets computer" state).
Ignoring abnormal conditions just makes it harder for people to figure out what went wrong. For example, imagine instructions to go to the toilet:
enter bathroom
remove pants
sit
start generating output
Now imagine that step 2 fails because you're wearing shorts (a "can't find pants" exception). Do you want to stop at that point (with a nice easy to understand error message or something), or ignore that step and attempt to figure out what went wrong later on, after all the useful diagnostic information has gone?
If I understand correctly, you want to skip the instruction that caused the exception (e.g. mov cs, eax) and continue executing the program at the next instruction.
Why would you want to do this? Normally, shouldn't the rest of the program depend on the effects of that instruction being successfully executed?
Generally speaking, there are three approaches to exception handling:
Treat the exception as an unrepairable condition and kill the process. For example, division by zero is usually handled this way.
Repair the environment and then execute the instruction again. For example, page faults are sometimes handled this way.
Emulate the instruction using software and skip over it in the instruction stream. For example, complicated arithmetic instructions are sometimes handled this way.
What you're seeing is the characteristic of the General Protection Exception. The Intel System Programming Guide clearly states that (6.15 Exception and Interrupt Reference / Interrupt 13 - General Protection Exception (#GP)) :
Saved Instruction Pointer
The saved contents of CS and EIP registers point to the instruction that generated the
exception.
Therefore, you need to write an exception handler that will skip over that instruction (which would be kind of weird), or just simply kill the offending process with "General Protection Exception at $SAVED_EIP" or a similar message.
I can imagine a few situations in which one would want to respond to a GPF by parsing the failed instruction, emulating its operation, and then returning to the instruction after. The normal pattern would be to set things up so that the instruction, if retried, would succeed, but one might e.g. have some code that expects to access some hardware at addresses 0x000A0000-0x000AFFFF and wish to run it on a machine that lacks such hardware. In such a situation, one might not want to ever bank in "real" memory in that space, since every single access must be trapped and dealt with separately. I'm not sure whether there's any way to handle that without having to decode whatever instruction was trying to access that memory, although I do know that some virtual-PC programs seem to manage it pretty well.
Otherwise, I would suggest that you should have for each thread a jump vector which should be used when the system encounters a GPF. Normally that vector should point to a thread-exit routine, but code which was about to do something "suspicious" with pointers could set it to an error handler that was suitable for that code (the code should unset the vector when laving the region where the error handler would have been appropriate).
I can imagine situations where one might want to emulate an instruction without executing it, and cases where one might want to transfer control to an error-handler routine, but I can't imagine any where one would want to simply skip over an instruction that would have caused a GPF.

Could ARM9 Prefetch Abort Exception be a software problem?

So I'm getting a "prefetch abort" exception on our arm9 system. This system does not have an MMU, so is there anyway this could be a software problem? All the registers seem correct to me, and the code looks right (not corrupted) from the JTAG point of view.
Right now I'm thinking this is some kind of hardware issue (although I hate to say it - the hardware has been fine until now).
What exactly is the exception you're getting?
Last time this happened to me, I went up the wrong creek for a while because I didn't realize an ARM "prefetch abort" meant the instruction prefetch, not data prefetch, and I'd just been playing with data prefetch instructions. It simply means that the program has attempted to jump to a memory location that doesn't exist. (The actual problem was that I'd mistyped "go 81000000" as "go 81000" in the bootloader.)
See also:
http://www.keil.com/support/docs/3080.htm (KB entry on debugging data aborts)
http://www.ethernut.de/en/documents/arm-exceptions.html (list of ARM exceptions)
What's the address that the prefetch abort is triggering on. It can occur because the program counter (PC or R15) is being set to an address that isn't valid on your microcontroller (this can happen even if you're not using an MMU - the microcontroller's address space likely has 'holes' in it that will trigger the prefetch abort). It could also occur if you try to prefetch an address that would be improperly aligned, but I think this dpends on the microcontroller implementation (the ARM ARM lists the behavior as 'UPREDICTABLE').
Is the CPU actually in Abort mode? If it's executing the Prefetch handler but isn't in abort mode that would mean that some code is branching through the prefetch abort vector, generally through address 0x0000000c but controllers often allow the vector addresses to be remapped.

How to determine why a task destroys , VxWorks?

I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards
It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help
If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.
Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.