Synchronous External Abort on ARM

Synchronous External Abort on ARM - exception

I was building a bare metal application on ARM Cortex A9 Pandaboard, and I got Instruction Fetch Abort frequently. When I dump IFSR Register I got 0x1008. I've read the reference manual, and I understand that 1008 was Synchronous External Abort. The problem is what synchronous external abort means and where does it come from? Thanks for your help.

The ARMv7 ARM section "VMSA Memory aborts" covers this as thoroughly as one would expect (given that it's the authoritative definition of the architecture), but to summarise in slightly less than 14 pages;
An abort means the CPU tried to make a memory access, which for whatever reason, couldn't be completed so raises an exception.
An external abort is one from, well, externally to the processor, i.e. something on the bus. In other words, the access didn't fault in the MMU, went out onto the bus, and either some device or the interconnect itself came back and said "hey, I can't deal with this".
A synchronous external abort means you're rather fortunate, in that it's not going to be utterly hideous to debug - in the case of a prefetch abort, it means the IFAR is going to contain a valid VA for the faulting instruction, so you know exactly what caused it. The unpleasant alternative is an asynchronous external abort, which is little more than an interrupt to say "hey, something you did a while ago didn't actually work. No I don't know what is was either."
So, you're trying to execute instructions from something that you think is memory, but isn't. Without any further details, the actual cause could be anything from a typoed hard-coded address, to dodgy page tables, stale TLB entries, cache coherency, etc. etc.

Related

ARM Asynchronous external abort

I am programming a bare-metal application on a Cortex-A9 and I am frequently getting Data Abort Exceptions. When I look for the reason for this exception in the Data Fault Status Register (DFSR) of CP15, the value of the Fault Status bits (FS) is b10110. I looked at the specification and b10110 means "Asynchronous external abort". What does this mean? I can't find any useful information about this kind of abort.
For example I also get Alignment faults sometimes, but I know what this means and so I can track down those kinds of faults comparatively easy. But I don't know how to handle asynchronous external aborts, since I don't know what they mean or why they occur. Thanks for your help.

Wild guess... You have unaligned writes, they get buffered and core moves on to subsequential instructions, writes start to happen and fails. Core has no idea where they are originated from, fails with an async data abort.
Reading Chapter 11.1 Types of Exceptions, Aborts from Cortex-A Series Programmers Guide might give you some ideas.

OS development: How to avoid an infinite loop after an exception routine

For some months I've been working on a "home-made" operating system.
Currently, it boots and goes into 32-bit protected mode.
I've loaded the interrupt table, but haven't set up the pagination (yet).
Now while writing my exception routines I've noticed that when an instruction throws an exception, the exception routine is executed, but then the CPU jumps back to the instruction which threw the exception! This does not apply to every exception (for example, a div by zero exception will jump back to the instruction AFTER the division instruction), but let's consider the following general protection exception:
MOV EAX, 0x8
MOV CS, EAX
My routine is simple: it calls a function that displays a red error message.
The result: MOV CS, EAX fails -> My error message is displayed -> CPU jumps back to MOV CS -> infinite loop spamming the error message.
I've talked about this issue with a teacher in operating systems and unix security.
He told me he knows Linux has a way around it, but he doesn't know which one.
The naive solution would be to parse the throwing instruction from within the routine, in order to get the length of that instruction.
That solution is pretty complex, and I feel a bit uncomfortable adding a call to a relatively heavy function in every affected exception routine...
Therefore, I was wondering if the is another way around the problem. Maybe there's a "magic" register that contains a bit that can change this behaviour?
--
Thank you very much in advance for any suggestion/information.
--
EDIT: It seems many people wonder why I want to skip over the problematic instruction and resume normal execution.
I have two reasons for this:
First of all, killing a process would be a possible solution, but not a clean one. That's not how it's done in Linux, for example, where (AFAIK) the kernel sends a signal (I think SIGSEGV) but does not immediately break execution. It makes sense, since the application can block or ignore the signal and resume its own execution. It's a very elegant way to tell the application it did something wrong IMO.
Another reason: what if the kernel itself performs an illegal operation? Could be due to a bug, but could also be due to a kernel extension. As I've stated in a comment: what should I do in that case? Shall I just kill the kernel and display a nice blue screen with a smiley?
That's why I would like to be able to jump over the instruction. "Guessing" the instruction size is obviously not an option, and parsing the instruction seems fairly complex (not that I mind implementing such a routine, but I need to be sure there is no better way).

Different exceptions have different causes. Some exceptions are normal, and the exception only tells the kernel what it needs to do before allowing the software to continue running. Examples of this include a page fault telling the kernel it needs to load data from swap space, an undefined instruction exception telling the kernel it needs to emulate an instruction that the CPU doesn't support, or a debug/breakpoint exception telling the kernel it needs to notify a debugger. For these it's normal for the kernel to fix things up and silently continue.
Some exceptions indicate abnormal conditions (e.g. that the software crashed). The only sane way of handling these types of exceptions is to stop running the software. You may save information (e.g. core dump) or display information (e.g. "blue screen of death") to help with debugging, but in the end the software stops (either the process is terminated, or the kernel goes into a "do nothing until user resets computer" state).
Ignoring abnormal conditions just makes it harder for people to figure out what went wrong. For example, imagine instructions to go to the toilet:
enter bathroom
remove pants
sit
start generating output
Now imagine that step 2 fails because you're wearing shorts (a "can't find pants" exception). Do you want to stop at that point (with a nice easy to understand error message or something), or ignore that step and attempt to figure out what went wrong later on, after all the useful diagnostic information has gone?

If I understand correctly, you want to skip the instruction that caused the exception (e.g. mov cs, eax) and continue executing the program at the next instruction.
Why would you want to do this? Normally, shouldn't the rest of the program depend on the effects of that instruction being successfully executed?
Generally speaking, there are three approaches to exception handling:
Treat the exception as an unrepairable condition and kill the process. For example, division by zero is usually handled this way.
Repair the environment and then execute the instruction again. For example, page faults are sometimes handled this way.
Emulate the instruction using software and skip over it in the instruction stream. For example, complicated arithmetic instructions are sometimes handled this way.

What you're seeing is the characteristic of the General Protection Exception. The Intel System Programming Guide clearly states that (6.15 Exception and Interrupt Reference / Interrupt 13 - General Protection Exception (#GP)) :
Saved Instruction Pointer
The saved contents of CS and EIP registers point to the instruction that generated the
exception.
Therefore, you need to write an exception handler that will skip over that instruction (which would be kind of weird), or just simply kill the offending process with "General Protection Exception at $SAVED_EIP" or a similar message.

I can imagine a few situations in which one would want to respond to a GPF by parsing the failed instruction, emulating its operation, and then returning to the instruction after. The normal pattern would be to set things up so that the instruction, if retried, would succeed, but one might e.g. have some code that expects to access some hardware at addresses 0x000A0000-0x000AFFFF and wish to run it on a machine that lacks such hardware. In such a situation, one might not want to ever bank in "real" memory in that space, since every single access must be trapped and dealt with separately. I'm not sure whether there's any way to handle that without having to decode whatever instruction was trying to access that memory, although I do know that some virtual-PC programs seem to manage it pretty well.
Otherwise, I would suggest that you should have for each thread a jump vector which should be used when the system encounters a GPF. Normally that vector should point to a thread-exit routine, but code which was about to do something "suspicious" with pointers could set it to an error handler that was suitable for that code (the code should unset the vector when laving the region where the error handler would have been appropriate).
I can imagine situations where one might want to emulate an instruction without executing it, and cases where one might want to transfer control to an error-handler routine, but I can't imagine any where one would want to simply skip over an instruction that would have caused a GPF.

Could ARM9 Prefetch Abort Exception be a software problem?

So I'm getting a "prefetch abort" exception on our arm9 system. This system does not have an MMU, so is there anyway this could be a software problem? All the registers seem correct to me, and the code looks right (not corrupted) from the JTAG point of view.
Right now I'm thinking this is some kind of hardware issue (although I hate to say it - the hardware has been fine until now).

What exactly is the exception you're getting?
Last time this happened to me, I went up the wrong creek for a while because I didn't realize an ARM "prefetch abort" meant the instruction prefetch, not data prefetch, and I'd just been playing with data prefetch instructions. It simply means that the program has attempted to jump to a memory location that doesn't exist. (The actual problem was that I'd mistyped "go 81000000" as "go 81000" in the bootloader.)
See also:
http://www.keil.com/support/docs/3080.htm (KB entry on debugging data aborts)
http://www.ethernut.de/en/documents/arm-exceptions.html (list of ARM exceptions)

What's the address that the prefetch abort is triggering on. It can occur because the program counter (PC or R15) is being set to an address that isn't valid on your microcontroller (this can happen even if you're not using an MMU - the microcontroller's address space likely has 'holes' in it that will trigger the prefetch abort). It could also occur if you try to prefetch an address that would be improperly aligned, but I think this dpends on the microcontroller implementation (the ARM ARM lists the behavior as 'UPREDICTABLE').
Is the CPU actually in Abort mode? If it's executing the Prefetch handler but isn't in abort mode that would mean that some code is branching through the prefetch abort vector, generally through address 0x0000000c but controllers often allow the vector addresses to be remapped.

How to determine why a task destroys , VxWorks?

I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards

It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help

If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.

Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.

Is "Out Of Memory" A Recoverable Error?

I've been programming a long time, and the programs I see, when they run out of memory, attempt to clean up and exit, i.e. fail gracefully. I can't remember the last time I saw one actually attempt to recover and continue operating normally.
So much processing relies on being able to successfully allocate memory, especially in garbage collected languages, it seems that out of memory errors should be classified as non-recoverable. (Non-recoverable errors include things like stack overflows.)
What is the compelling argument for making it a recoverable error?

It really depends on what you're building.
It's not entirely unreasonable for a webserver to fail one request/response pair but then keep on going for further requests. You'd have to be sure that the single failure didn't have detrimental effects on the global state, however - that would be the tricky bit. Given that a failure causes an exception in most managed environments (e.g. .NET and Java) I suspect that if the exception is handled in "user code" it would be recoverable for future requests - e.g. if one request tried to allocate 10GB of memory and failed, that shouldn't harm the rest of the system. If the system runs out of memory while trying to hand off the request to the user code, however - that kind of thing could be nastier.

In a library, you want to efficiently copy a file. When you do that, you'll usually find that copying using a small number of big chunks is much more effective than copying a lot of smaller ones (say, it's faster to copy a 15MB file by copying 15 1MB chunks than copying 15'000 1K chunks).
But the code works with any chunk size. So while it may be faster with 1MB chunks, if you design for a system where a lot of files are copied, it may be wise to catch OutOfMemoryError and reduce the chunk size until you succeed.
Another place is a cache for Object stored in a database. You want to keep as many objects in the cache as possible but you don't want to interfere with the rest of the application. Since these objects can be recreated, it's a smart way to conserve memory to attach the cache to an out of memory handler to drop entries until the rest of the app has enough room to breathe, again.
Lastly, for image manipulation, you want to load as much of the image into memory as possible. Again, an OOM-handler allows you to implement that without knowing in advance how much memory the user or OS will grant your code.
[EDIT] Note that I work under the assumption here that you've given the application a fixed amount of memory and this amount is smaller than the total available memory excluding swap space. If you can allocate so much memory that part of it has to be swapped out, several of my comments don't make sense anymore.

Users of MATLAB run out of memory all the time when performing arithmetic with large arrays. For example if variable x fits in memory and they run "x+1" then MATLAB allocates space for the result and then fills it. If the allocation fails MATLAB errors and the user can try something else. It would be a disaster if MATLAB exited whenever this use case came up.

OOM should be recoverable because shutdown isn't the only strategy to recovering from OOM.
There is actually a pretty standard solution to the OOM problem at the application level.
As part of you application design determine a safe minimum amount of memory required to recover from an out of memory condition. (Eg. the memory required to auto save documents, bring up warning dialogs, log shutdown data).
At the start of your application or at the start of a critical block, pre-allocate that amount of memory. If you detect an out of memory condition release your guard memory and perform recovery. The strategy can still fail but on the whole gives great bang for the buck.
Note that the application need not shut down. It can display a modal dialog until the OOM condition has been resolved.
I'm not 100% certain but I'm pretty sure 'Code Complete' (required reading for any respectable software engineer) covers this.
P.S. You can extend your application framework to help with this strategy but please don't implement such a policy in a library (good libraries do not make global decisions without an applications consent)

I think that like many things, it's a cost/benefit analysis. You can program in attempted recovery from a malloc() failure - although it may be difficult (your handler had better not fall foul of the same memory shortage it's meant to deal with).
You've already noted that the commonest case is to clean up and fail gracefully. In that case it's been decided that the cost of aborting gracefully is lower than the combination of development cost and performance cost in recovering.
I'm sure you can think of your own examples of situations where terminating the program is a very expensive option (life support machine, spaceship control, long-running and time-critical financial calculation etc.) - although the first line of defence is of course to ensure that the program has predictable memory usage and that the environment can supply that.

I'm working on a system that allocates memory for IO cache to increase performance. Then, on detecting OOM, it takes some of it back, so that the business logic could proceed, even if that means less IO cache and slightly lower write performance.
I also worked with an embedded Java applications that attempted to manage OOM by forcing garbage collection, optionally releasing some of non-critical objects, like pre-fetched or cached data.
The main problems with OOM handling are:
1) being able to re-try in the place where it happened or being able to roll back and re-try from a higher point. Most contemporary programs rely too much on the language to throw and don't really manage where they end up and how to re-try the operation. Usually the context of the operation will be lost, if it wasn't designed to be preserved
2) being able to actually release some memory. This means a kind of resource manager that knows what objects are critical and what are not, and the system be able to re-request the released objects when and if they later become critical
Another important issue is to be able to roll back without triggering yet another OOM situation. This is something that is hard to control in higher level languages.
Also, the underlying OS must behave predictably with regard to OOM. Linux, for example, will not, if memory overcommit is enabled. Many swap-enabled systems will die sooner than reporting the OOM to the offending application.
And, there's the case when it is not your process that created the situation, so releasing memory does not help if the offending process continues to leak.
Because of all this, it's often the big and embedded systems that employ this techniques, for they have the control over OS and memory to enable them, and the discipline/motivation to implement them.

It is recoverable only if you catch it and handle it correctly.
In same cases, for example, a request tried to allocate a lot memory. It is quite predictable and you can handle it very very well.
However, in many cases in multi-thread application, OOE may also happen on background thread (including created by system/3rd-party library).
It is almost imposable to predict and you may unable to recover the state of all your threads.

No.
An out of memory error from the GC is should not generally be recoverable inside of the current thread. (Recoverable thread (user or kernel) creation and termination should be supported though)
Regarding the counter examples: I'm currently working on a D programming language project which uses NVIDIA's CUDA platform for GPU computing. Instead of manually managing GPU memory, I've created proxy objects to leverage the D's GC. So when the GPU returns an out of memory error, I run a full collect and only raise an exception if it fails a second time. But, this isn't really an example of out of memory recovery, it's more one of GC integration. The other examples of recovery (caches, free-lists, stacks/hashes without auto-shrinking, etc) are all structures that have their own methods of collecting/compacting memory which are separate from the GC and tend not to be local to the allocating function.
So people might implement something like the following:
T new2(T)( lazy T old_new ) {
T obj;
try{
obj = old_new;
}catch(OutOfMemoryException oome) {
foreach(compact; Global_List_Of_Delegates_From_Compatible_Objects)
compact();
obj = old_new;
}
return obj;
}
Which is a decent argument for adding support for registering/unregistering self-collecting/compacting objects to garbage collectors in general.

In the general case, it's not recoverable.
However, if your system includes some form of dynamic caching, an out-of-memory handler can often dump the oldest elements in the cache (or even the whole cache).
Of course, you have to make sure that the "dumping" process requires no new memory allocations :) Also, it can be tricky to recover the specific allocation that failed, unless you're able to plug your cache dumping code directly at the allocator level, so that the failure isn't propagated up to the caller.

It depends on what you mean by running out of memory.
When malloc() fails on most systems, it's because you've run out of address-space.
If most of that memory is taken by cacheing, or by mmap'd regions, you might be able to reclaim some of it by freeing your cache or unmmaping. However this really requires that you know what you're using that memory for- and as you've noticed either most programs don't, or it doesn't make a difference.
If you used setrlimit() on yourself (to protect against unforseen attacks, perhaps, or maybe root did it to you), you can relax the limit in your error handler. I do this very frequently- after prompting the user if possible, and logging the event.
On the other hand, catching stack overflow is a bit more difficult, and isn't portable. I wrote a posixish solution for ECL, and described a Windows implementation, if you're going this route. It was checked into ECL a few months ago, but I can dig up the original patches if you're interested.

Especially in garbage collected environments, it's quote likely that if you catch the OutOfMemory error at a high level of the application, lots of stuff has gone out of scope and can be reclaimed to give you back memory.
In the case of single excessive allocations, the app may be able to continue working flawlessly. Of course, if you have a gradual memory leak, you'll just run into the problem again (more likely sooner than later), but it's still a good idea to give the app a chance to go down gracefully, save unsaved changes in the case of a GUI app, etc.

Yes, OOM is recoverable. As an extreme example, the Unix and Windows operating systems recover quite nicely from OOM conditions, most of the time. The applications fail, but the OS survives (assuming there is enough memory for the OS to properly start up in the first place).
I only cite this example to show that it can be done.
The problem of dealing with OOM is really dependent on your program and environment.
For example, in many cases the place where the OOM happens most likely is NOT the best place to actually recover from an OOM state.
Now, a custom allocator could possibly work as a central point within the code that can handle an OOM. The Java allocator will perform a full GC before is actually throws a OOM exception.
The more "application aware" that your allocator is, the better suited it would be as a central handler and recovery agent for OOM. Using Java again, it's allocator isn't particularly application aware.
This is where something like Java is readily frustrating. You can't override the allocator. So, while you could trap OOM exceptions in your own code, there's nothing saying that some library you're using is properly trapping, or even properly THROWING an OOM exception. It's trivial to create a class that is forever ruined by a OOM exception, as some object gets set to null and "that never happen", and it's never recoverable.
So, yes, OOM is recoverable, but it can be VERY hard, particularly in modern environments like Java and it's plethora of 3rd party libraries of various quality.

The question is tagged "language-agnostic", but it's difficult to answer without considering the language and/or the underlying system. (I see several toher hadns
If memory allocation is implicit, with no mechanism to detect whether a given allocation succeeded or not, then recovering from an out-of-memory condition may be difficult or impossible.
For example, if you call a function that attempts to allocate a huge array, most languages just don't define the behavior if the array can't be allocated. (In Ada this raises a Storage_Error exception, at least in principle, and it should be possible to handle that.)
On the other hand, if you have a mechanism that attempts to allocate memory and is able to report a failure to do so (like C's malloc() or C++'s new), then yes, it's certainly possible to recover from that failure. In at least the cases of malloc() and new, a failed allocation doesn't do anything other than report failure (it doesn't corrupt any internal data structures, for example).
Whether it makes sense to try to recover depends on the application. If the application just can't succeed after an allocation failure, then it should do whatever cleanup it can and terminate. But if the allocation failure merely means that one particular task cannot be performed, or if the task can still be performed more slowly with less memory, then it makes sense to continue operating.
A concrete example: Suppose I'm using a text editor. If I try to perform some operation within the editor that requires a lot of memory, and that operation can't be performed, I want the editor to tell me it can't do what I asked and let me keep editing. Terminating without saving my work would be an unacceptable response. Saving my work and terminating would be better, but is still unnecessarily user-hostile.

This is a difficult question. On first sight it seems having no more memory means "out of luck" but, you must also see that one can get rid of many memory related stuff if one really insist. Let's just take the in other ways broken function strtok which on one hand has no problems with memory stuff. Then take as counterpart g_string_split from the Glib library, which heavily depends on allocation of memory as nearly everything in glib or GObject based programs. One can definitly say in more dynamic languages memory allocation is much more used as in more inflexible languages, especially C. But let us see the alternatives. If you just end the program if you run out of memory, even careful developed code may stop working. But if you have a recoverable error, you can do something about it. So the argument, making it recoverable means that one can choose to "handle" that situation differently (e.g putting aside a memory block for emergencies, or degradation to a less memory extensive program).
So the most compelling reason is. If you provide a way of recovering one can try the recoverying, if you do not have the choice all depends on always getting enough memory...
Regards

It's just puzzling me now.
At work, we have a bundle of applications working together, and memory is running low. While the problem is either make the application bundle go 64-bit (and so, be able to work beyond the 2 Go limits we have on a normal Win32 OS), and/or reduce our use of memory, this problem of "How to recover from a OOM" won't quit my head.
Of course, I have no solution, but still play at searching for one for C++ (because of RAII and exceptions, mainly).
Perhaps a process supposed to recover gracefully should break down its processing in atomic/rollback-able tasks (i.e. using only functions/methods giving strong/nothrow exception guarantee), with a "buffer/pool of memory" reserved for recovering purposes.
Should one of the task fails, the C++ bad_alloc would unwind the stack, free some stack/heap memory through RAII. The recovering feature would then salvage as much as possible (saving the initial data of the task on the disk, to use on a later try), and perhaps register the task data for later try.
I do believe the use of C++ strong/nothrow guanrantees can help a process to survive in low-available-memory conditions, even if it would be akin memory swapping (i.e. slow, somewhat unresponding, etc.), but of course, this is only theory. I just need to get smarter on the subject before trying to simulate this (i.e. creating a C++ program, with a custom new/delete allocator with limited memory, and then try to do some work under those stressful condition).
Well...

Out of memory normally means you have to quit whatever you were doing. If you are careful about cleanup, though, it can leave the program itself operational and able to respond to other requests. It's better to have a program say "Sorry, not enough memory to do " than say "Sorry, out of memory, shutting down."

Out of memory can be caused either by free memory depletion or by trying to allocate an unreasonably big block (like one gig). In "depletion" cases memory shortage is global to the system and usually affects other applications and system services and the whole system might become unstable so it's wise to forget and reboot. In "unreasonably big block" cases no shortage actually occurs and it's safe to continue. The problem is you can't automatically detect which case you're in. So it's safer to make the error non-recoverable and find a workaround for each case you encounter this error - make your program use less memory or in some cases just fix bugs in code that invokes memory allocation.

There are already many good answers here. But I'd like to contribute with another perspective.
Depletion of just about any reusable resource should be recoverable in general. The reasoning is that each and every part of a program is basically a sub program. Just because one sub cannot complete to it's end at this very point in time, does not mean that the entire state of the program is garbage. Just because the parking lot is full of cars does not mean that you trash your car. Either you wait a while for a booth to be free, or you drive to a store further away to buy your cookies.
In most cases there is an alternative way. Making an out of error unrecoverable, effectively removes a lot of options, and none of us like to have anyone decide for us what we can and cannot do.
The same applies to disk space. It's really the same reasoning. And contrary to your insinuation about stack overflow is unrecoverable, i would say that it's and arbitrary limitation. There is no good reason that you should not be able to throw an exception (popping a lot of frames) and then use another less efficient approach to get the job done.
My two cents :-)

If you are really out of memory you are doomed, since you can not free anything anymore.
If you are out of memory, but something like a garbage collector can kick in and free up some memory you are non dead yet.
The other problem is fragmentation. Although you might not be out of memory (fragmented), you might still not be able to allocate the huge chunk you wanna have.

I know you asked for arguments for, but I can only see arguments against.
I don't see anyway to achieve this in a multi-threaded application. How do you know which thread is actually responsible for the out-of-memory error? One thread could allocating new memory constantly and have gc-roots to 99% of the heap, but the first allocation that fails occurs in another thread.
A practical example: whenever I have occurred an OutOfMemoryError in our Java application (running on a JBoss server), it's not like one thread dies and the rest of the server continues to run: no, there are several OOMEs, killing several threads (some of which are JBoss' internal threads). I don't see what I as a programmer could do to recover from that - or even what JBoss could do to recover from it. In fact, I am not even sure you CAN: the javadoc for VirtualMachineError suggests that the JVM may be "broken" after such an error is thrown. But maybe the question was more targeted at language design.

uClibc has an internal static buffer of 8 bytes or so for file I/O when there is no more memory to be allocated dynamically.

What is the compelling argument for making it a recoverable error?
In Java, a compelling argument for not making it a recoverable error is because Java allows OOM to be signalled at any time, including at times where the result could be your program entering an inconsistent state. Reliable recoery from an OOM is therefore impossible; if you catch the OOM exception, you can not rely on any of your program state. See
No-throw VirtualMachineError guarantees

I'm working on SpiderMonkey, the JavaScript VM used in Firefox (and gnome and a few others). When you're out of memory, you may want to do any of the following things:
Run the garbage-collector. We don't run the garbage-collector all the time, as it would kill performance and battery, so by the time you're reaching out of memory error, some garbage may have accumulated.
Free memory. For instance, get rid of some of the in-memory cache.
Kill or postpone non-essential tasks. For instance, unload some tabs that haven't be used in a long time from memory.
Log things to help the developer troubleshoot the out-of-memory error.
Display a semi-nice error message to let the user know what's going on.
...
So yes, there are many reasons to handle out-of-memory errors manually!

I have this:
void *smalloc(size_t size) {
void *mem = null;
for(;;) {
mem = malloc(size);
if(mem == NULL) {
sleep(1);
} else
break;
}
return mem;
}
Which has saved a system a few times already. Just because you're out of memory now, doesn't mean some other part of the system or other processes running on the system have some memory they'll give back soon. You better be very very careful before attempting such tricks, and have all control over every memory you do allocate in your program though.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008