Real time system exception handling

Real time system exception handling - exception

As always after some research I was unable to find anything of real value. My question is how does one go about handling exceptions in a real time system? As program failure generally is not the best case i.e. nuclear reactor/ heart monitor.
Ok since everyone got lost on the second piece of this, which had NOTHING to do with the main question. I had it in there to show how I normally escape code blocks.

Exception handling in real-time/embedded systems has several layers. Not just the language supported options, but also MMU, CPU exceptions and one of my favorites: watchdogs.
Language exceptions (C/C++)
- not often used, beause it is hard to prove that all exceptions are handled at the right level. Also it is pretty hard to determine what threat/process should be responsible. Instead, programming by contract is preferred.
Programming style:
- i.e. programming by contract. Additional constraints : Misra/C Misra/C++. This can be checked to unsure that all possible cases are somehow handled. (i.e. no if without else)
Hardware support:
- MMU : use of multiple processes which are protected against each other. This allows
- watchdog
- CPU exceptions
- multi core: use of multiple cores to separate cricical processes from the rest. Also allows to have voting mechanisms (you want this and more for your nuclear reactor).
- multi-system
Most important is to define a strategy. Depending on the other nonfunctional requirements (safety, reliability, security) a strategy needs to be thought of. Can be graceful degradation to partial system reboot.

In a 'real-time', 'nuclear reactor' type system, chances are the exception handling allows the system to instead of fail, do the next best thing.
Let's say that we have a heart monitor. If it isn't receiving a signal, that might trigger an exception. In that case, the heart monitor might handle the exception by waiting a few seconds and trying again.
In a nuclear reactor, getting to a certain temperature might trigger an exception. In that case, the handling might shut off various parts of the reactor to start to cool it down, and then start them back up when it gets to a reasonable temperature.
Exceptions are meant to have a lower-level system say that it doesn't know what to do, and to have a higher level system handle it. Like in the nuclear reactor, the system that measures temperature probably doesn't know how to turn on parts of the reactor, so it triggers an exception so that some higher-level system can handle it.

A critical system is like any other system, except it's specified more clearly, passes through more testing phases, and is generally will fail-safe.
Regarding your form, yes it's pretty bad. I do mind the lack of {} very much; and it's been said so-often that this is just plain bad style, and leads to confusion when adding new code.

Related

Precise exception

I was going through the book The Design and Implementation of the FreeBSD operating system and I came across this:
This ability to restart an instruction is called a precise exception.
The CPU may implement restarting by saving enough state when an
instruction begins that the state can be restored when a fault is
discovered. Alternatively, instructions could delay any modifications
or side effects until after any faults would be discovered so that the
instruction execution does not need to back up before restarting.
I could not understand what does
modification or side effects
refer to in the passage. Can anyone elaborate?

That description from the FreeBSD book is very OS centric. I even disagree with its definition, This ability to restart an instruction is called a precise exception. You don't restart an instruction after a power failure exception. So rather than try to figure out what McKusick meant, I'm gonna suggest going elsewhere for a better description of exceptions.
BTW, I prefer Smith's definition:
An interrupt is precise if the saved process state corresponds with
the sequential model of program execution where one instruction completes before the next begins.
https://lagunita.stanford.edu/c4x/Engineering/CS316/asset/smith.precise_exceptions.pdf
Almost all modern processors support precise exceptions. So the hard work is already done. What an OS must do is register a trap handler for the exceptions that the hardware takes. For example, there will be a page fault handler, floating point, .... To figure out what is necessary for these handlers you'd have to read the processor theory of operations.
Despite what seems like gritty systems detail, that is fairly high level. It doesn't say anything about what the hardware is doing and so the FreeBSD description is shorthanding a lot.
To really understand precise exceptions, you need to read about that in the context of pipelining, out of order, superscalar, ... in a computer architecture book. I'd recommend Computer Architecture A Quantitative Approach 6th ed. There's a section Dealing With Exceptions p C-38 that presents a taxonomy of the different types of exceptions. The FreeBSD description is only describing some exceptions. It then gets into how each exception type is handled by the pipeline.
Also, The Linux Programming Interface has 3 long chapters on the POSIX signal interface. I know it's not FreeBSD but it covers what an application will see when, for example, a floating point exception is taken and a SIGFPE signal is sent to the process.

Ada Exceptions in Safety Critical Embedded Systems [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I started learning Ada for its potential use in an embedded device which is safety critical. So far, I'm really liking it. However, in my research on embedded programming, I came across the hot topic of whether to use exception handling in embedded systems. I think I understand why some people seem to avoid it:
depending on its implementation it can introduce either run-time overhead or larger code size (mentioned here under "Implementation")
the time it takes to execute exceptions can be non-deterministic (one of several sources I saw)
Now my question is, Does the Ada language or the GNAT compiler address these concerns? My understanding of safety critical code is that non-deterministic code size and execution time is often not acceptable.
Due Diligence: I am having a bit of trouble finding out exactly how deterministic Ada exceptions can be, but my understanding is their original implementation called for more run-time overhead in exchange for reduced code size impact (above first link mentions Ada explicitly). Beyond the above first link, I have looked into profiles mentioning determinism of code, like the Ravenscar profile and this paper, but nothing seems to mention exception handling determinism. To be fair, I may be looking in the wrong places, as this topic seems quite deep.

There are embedded systems that are safety- or mission-critical, embedded systems that are hard real time, and embedded systems that are both.
Embedded systems that are hard real time may be constrained or not. Colleagues worked on a missile guidance system in the 70s that had about 4 instructions worth of headroom in its main loop! (as you can imagine, it was written in assembler and used a tuned executive, not an RTOS. Exceptions weren't supported). On the other hand, the last one I worked on, on a 1 GHz PowerPC board, had a 2 millisecond deadline for the response to a particular interrupt, and our measured worst case was 1.3 milliseconds (and it was a soft real time requirement anyway, you just didn't have to miss too many in a row).
That system also had safety requirements (I know, I know, safe missile systems, huh) and although we were permitted to use exceptions, an unhandled exception meant that the system had to be shut down, missile in flight or no, resulting in loss of missile. And we were strictly forbidden to say when others => null; to swallow an exception, so any exception we didn't handle would be 'unhandled' and would bounce up to the top level.
The argument is, if an unhandled exception happens, you can no longer know the state of the system, so you can't justify continuing. Of course, the wider safety engineering has to consider what action the overall system should take (for example, perhaps this processor should restart in a recovery mode).
Sometimes people use exceptions as part of their control flow; indeed, for handling random text inputs a commonly used method is, rather than checking for end of file, just carry on until you get an End_Error;
loop
begin
-- read input
-- process input
exception
when End_Error => exit;
end;
end loop;
Jacob's answer discusses using SPARK. You don't have to use SPARK to not handle exceptions, though of course it would be nice to be able to prove to yourself (and your safety auditor!) that there won't be any. Handling exceptions is very tricky, and some RTSs (e.g Cortex GNAT RTS) don't; the configuration pragma
pragma Restrictions (No_Exception_Propagation);
means that exceptions can't be propagated out of the scope where they're raised (the program will crash out with a call to a Last_Chance_Handler).
Propagating exceptions only withon the scope where they're raised isn't, IMO, that useful:
begin
-- do something
if some error condition then
raise Err;
end if;
-- do more
exception
when Err =>
null;
end;
would be a rather confusing way of avoiding the "do more" code. Better to use a label!

Exceptions are deterministic in Ada. (But some checks which can raise an exception have some freedom. If the compiler can provide a correct answer, it doesn't always have to raise an exception, if an intermediate result is out of bounds for the type in question.)
At least one Ada compiler (GNAT) has a "zero cost" exception implementation. This doesn't make exceptions completely free, but you don't pay a run-time cost until you actually raise an exception. You still pay a cost in terms of code space. How large that cost is depends on the architecture.
I haven't worked on safety critical systems myself, but I know for sure that the run-time used for the software in the Ariane 4 inertial navigation system included exceptions.
If you don't want exceptions, one option is to use SPARK (a language derived from Ada). You can still use any Ada compiler you like, but you use the SPARK tools to prove that the program can't raise any exceptions. You should note that SPARK isn't magic. You have to help the tools, by inserting assertions, which the tools can use as intermediate steps for the proofs.

Applications of explicitly raising exceptions

What are the applications and advantages of explicitly raising exceptions in a program. For example, if we consider Ada language specifically here provides an interface to raise exceptions in the program. Example:
raise <Exception>;
But what are the advantages and application areas where we would need to raise exceptions explicitly?
For example, in a procedure which accepts one of the parameters as string:
function Fixed_Str_To_Chr_Ptr (Source_String : String) return C.Strings.Chars_Ptr is
...
begin
...
-- Check whether source string is of acceptable length
if Source_String'Length <= 100 then
...
else
...
raise Constraint_Error;
end if;
return Ptr;
exception
when Constraint_Error=>
.. Do Something..
end Fixed_Str_To_Chr_Ptr;
Is there any advantage or good practice if I raise an exception in the above function and handle it when the passed string length bound exceeds the tolerable limits? Or a simple If-else handler logic should do the business?

I'll make my 2 cents an answer in order to bundle the various aspects. Let's start with the general question
But what are the advantages and application areas where we would need to raise exceptions explicitly?
There are a few typical reasons for raising exceptions. Most of them are not Ada-specific.
First of all there may be a general design decision to use or not use exceptions. Some general criteria:
Exception handlers may incur a run time cost even if an exception is actually never thrown (cf. e.g. https://gcc.gnu.org/onlinedocs/gnat_ugn/Exception-Handling-Control.html). That may be unacceptable.
Issues of inter-operability with other languages may preclude the use of exceptions, or at least require that none leave the part programmed in Ada.
To a degree the decision is also a matter of taste. A programmer coming from a language without exceptions may feel more confident with a design which just relies on checking return values.
Some programs will benefit from exceptions more than others. If traditional error handling obscures the actual program structure it may be time for exceptions. If, on the other hand, potential errors are few, easily detected and can be handled locally, exceptions may obscure potential execution paths more than handling errors traditionally would.
Once the general decision to use exceptions has been made the problem arises when and when not it is appropriate to raise them in your code. I mentioned one general criteria in my comment. What comes to mind:
Exceptions should not be part of normal, expected program flow (they are called exceptions, not expectations ;-) ). This is partly because the control flow is harder to see and partly because of the potential run time cost.
Errors which can be handled locally don't need exceptions. (It can still be useful to raise one in order to have a uniform error handling though. I'll discuss that below when I get to your code snippet.)
On the other hand, exceptions are great if a function has no idea how to handle errors. This is particularly true for utility and library functions which can be called from a variety of contexts (GUI, console program, embedded, server ...). Exceptions allow the error to propagate up the call chain until somebody can handle it, without any error handling code in the intervening layers.
Some people say that a library should only expose custom exceptions, at least for any anticipated errors. E.g. when an I/O exception occurs, wrap it in a custom exception and explicitly raise that custom exception instead.
Now to your specific code question:
Is there any advantage or good practice if I raise an exception in the above function and handle it when the passed string length bound exceeds the tolerable limits? Or a simple If-else handler logic should do the business?
I don't particularly like that (although I don't find it terrible) because my general argument above ("if you can handle it locally, don't raise") would indicate that a simple if/else is clearer.1 For example, if the function is long the exception handler will be far away from the error location, so one may wonder where exactly the exception could occur (and finding one raise location is no guarantee that one has found them all, so the reviewer must scrutinize the whole function!).
It depends a bit on the specific circumstances though. Raising an exception may be elegant if an error can happen in several places. For example, if several strings can be too short it may be nice to have a centralized error handling through the exception handler, instead of scattering if/then/elses (nested??) across the function body. The situation is so common that a legitimate case can be made for using goto constructs in languages without exceptions. An exception is then clearly superior.
1But in all reality, how do you handle that error there? Do you have a guaranteed logging facility? What do you return? Does the caller know the result can be invalid? Maybe you should throw and not catch.

There are two problems with the given example:
It's simple enough that control flow doesn't need the exception. That won't always be the case, however, and I'll come back to that in a moment.
Constraint_Error is a spectacularly bad exception to raise, to detect a string length error. The standard exceptions Program_Error, Constraint_Error, Storage_Error ought to be reserved for programming error conditions, and in most circumstances ought to bring down the executable before it can do any damage, with enough debugging information (a stack traceback at the very least) to let you find the mistake and guarantee it never happens again.
It's remarkably satisfying to get a Constraint_Error pointing spookily close to your mistake, instead of whatever undefined behaviour happens much later on... (It's useful to learn how to turn on stack tracebacks, which aren't generally on by default).
Instead, you probably want to define your own String_Size_Error exception, raise that and handle it. Then, anything else in your unshown code that raises Constraint_Error will be properly debugged instead of silently generating a faulty Chars_Ptr.
For a valid use case for raising exceptions, consider a circuit simulator such as SPICE (or a CFD simulator for gas flow, etc). These tools, even when working properly, are prone to failures thanks to numerical problems that happen in matrix computations. (Two terms cancel, producing zero +/- rounding error, which causes infeasibly large numbers or divide-by-zero later on). It's often an iterative approximation, where the error should reduce in each step until it's an acceptably low value. But if a failure occurs, the error term will start growing...
Typically the simulation happens step by step, where each step is a sufficiently small time step, maybe 1 us or 1 ns. The main loop requests a step, and this request is passed to thousands of agents in the simulation representing components in a circuit, or triangles in a CFD mesh.
Any one of those agents may fail to compute a solution, and the cleanest way to handle a failure is to raise an exception, maybe Convergence_Error. There may be thousands of possible points where an exception can be raised.
Testing thousands of return codes would get ugly fast. But with exceptions, the main loop only needs one handler, which takes some corrective action such as reducing the simulation step size and running the step again.
Sanitizing user text input in a browser may be another good use case, closer to the example code.
One word on the runtime cost of exceptions : the Gnat compiler and its RTS supports a "Zero Cost Exception" (ZCX) model - at least for some targets. There's a larger penalty when an exception is raised, as a tradeoff against eliminating the penalty in the normal case. If the penalty matters to you, refer to the documentation to see if it's worthwhile 9or even possible) in your case.

You raise an exception explicitly to control which exception is reported to the user of a subprogram. - Or in some cases just to control the message associated with the raised exception.
In very special cases you may also raise an exception as a program flow control.

Exceptions should stay true to their name, which is to represent exceptional situations.

Design by Contract and Fail Fast

Fail Fast -
Fail-fast is a property of a system or module with respect to its
response to failures. A fail-fast system is designed to immediately
report at its interface any failure or condition that is likely to
lead to failure. Fail-fast systems are usually designed to stop normal
operation rather than attempt to continue a possibly flawed process.
Such designs often check the system's state at several points in an
operation, so any failures can be detected early. A fail-fast module
passes the responsibility for handling errors, but not detecting them,
to the next-higher system design level.
Design by Contract -
Design by contract (DbC), also known as contract programming,
programming by contract and design-by-contract programming, is an
approach for designing software. It prescribes that software designers
should define formal, precise and verifiable interface specifications
for software components, which extend the ordinary definition of
abstract data types with preconditions, postconditions and invariants.
These specifications are referred to as "contracts", in accordance
with a conceptual metaphor with the conditions and obligations of
business contracts.
My question is what is the similar and difference in both terms .
I thinking that both are for software design.
Fail fast is more of respond to a system failure and Design by Contract is more of the gurantee , the minimum and the expectation of a system.
But how do i actually define the difference between both of them and the similarity.
Thanks for helping .!

They are mutually exclusive. A Java iterator is fail fast but also design by contract. Fail fast just means, bomb out in the hope nothing worse will happens (e.g. throw an exception). Whereas something like fail safe, would usually mean when failure happens, make sure nothing worse happens. You can do this by isolating system components or by having something that will handle the case of failure so that nothing bad will happen (e.g. session replication / failover)

Similarities:
Both can be implemented via assertions
Both are intrinsic to the design of XML
Differences:
Design by Contract doesn't handle unexpected errors
Fail fast doesn't handle redundant checks
Design by Contract doesn't handle bad requirements
Fail fast doesn't handle requirements mapping
References
The Liskov Substitution Principle and Test-Driven Development | Effective Software Design

Is "Out Of Memory" A Recoverable Error?

I've been programming a long time, and the programs I see, when they run out of memory, attempt to clean up and exit, i.e. fail gracefully. I can't remember the last time I saw one actually attempt to recover and continue operating normally.
So much processing relies on being able to successfully allocate memory, especially in garbage collected languages, it seems that out of memory errors should be classified as non-recoverable. (Non-recoverable errors include things like stack overflows.)
What is the compelling argument for making it a recoverable error?

It really depends on what you're building.
It's not entirely unreasonable for a webserver to fail one request/response pair but then keep on going for further requests. You'd have to be sure that the single failure didn't have detrimental effects on the global state, however - that would be the tricky bit. Given that a failure causes an exception in most managed environments (e.g. .NET and Java) I suspect that if the exception is handled in "user code" it would be recoverable for future requests - e.g. if one request tried to allocate 10GB of memory and failed, that shouldn't harm the rest of the system. If the system runs out of memory while trying to hand off the request to the user code, however - that kind of thing could be nastier.

In a library, you want to efficiently copy a file. When you do that, you'll usually find that copying using a small number of big chunks is much more effective than copying a lot of smaller ones (say, it's faster to copy a 15MB file by copying 15 1MB chunks than copying 15'000 1K chunks).
But the code works with any chunk size. So while it may be faster with 1MB chunks, if you design for a system where a lot of files are copied, it may be wise to catch OutOfMemoryError and reduce the chunk size until you succeed.
Another place is a cache for Object stored in a database. You want to keep as many objects in the cache as possible but you don't want to interfere with the rest of the application. Since these objects can be recreated, it's a smart way to conserve memory to attach the cache to an out of memory handler to drop entries until the rest of the app has enough room to breathe, again.
Lastly, for image manipulation, you want to load as much of the image into memory as possible. Again, an OOM-handler allows you to implement that without knowing in advance how much memory the user or OS will grant your code.
[EDIT] Note that I work under the assumption here that you've given the application a fixed amount of memory and this amount is smaller than the total available memory excluding swap space. If you can allocate so much memory that part of it has to be swapped out, several of my comments don't make sense anymore.

Users of MATLAB run out of memory all the time when performing arithmetic with large arrays. For example if variable x fits in memory and they run "x+1" then MATLAB allocates space for the result and then fills it. If the allocation fails MATLAB errors and the user can try something else. It would be a disaster if MATLAB exited whenever this use case came up.

OOM should be recoverable because shutdown isn't the only strategy to recovering from OOM.
There is actually a pretty standard solution to the OOM problem at the application level.
As part of you application design determine a safe minimum amount of memory required to recover from an out of memory condition. (Eg. the memory required to auto save documents, bring up warning dialogs, log shutdown data).
At the start of your application or at the start of a critical block, pre-allocate that amount of memory. If you detect an out of memory condition release your guard memory and perform recovery. The strategy can still fail but on the whole gives great bang for the buck.
Note that the application need not shut down. It can display a modal dialog until the OOM condition has been resolved.
I'm not 100% certain but I'm pretty sure 'Code Complete' (required reading for any respectable software engineer) covers this.
P.S. You can extend your application framework to help with this strategy but please don't implement such a policy in a library (good libraries do not make global decisions without an applications consent)

I think that like many things, it's a cost/benefit analysis. You can program in attempted recovery from a malloc() failure - although it may be difficult (your handler had better not fall foul of the same memory shortage it's meant to deal with).
You've already noted that the commonest case is to clean up and fail gracefully. In that case it's been decided that the cost of aborting gracefully is lower than the combination of development cost and performance cost in recovering.
I'm sure you can think of your own examples of situations where terminating the program is a very expensive option (life support machine, spaceship control, long-running and time-critical financial calculation etc.) - although the first line of defence is of course to ensure that the program has predictable memory usage and that the environment can supply that.

I'm working on a system that allocates memory for IO cache to increase performance. Then, on detecting OOM, it takes some of it back, so that the business logic could proceed, even if that means less IO cache and slightly lower write performance.
I also worked with an embedded Java applications that attempted to manage OOM by forcing garbage collection, optionally releasing some of non-critical objects, like pre-fetched or cached data.
The main problems with OOM handling are:
1) being able to re-try in the place where it happened or being able to roll back and re-try from a higher point. Most contemporary programs rely too much on the language to throw and don't really manage where they end up and how to re-try the operation. Usually the context of the operation will be lost, if it wasn't designed to be preserved
2) being able to actually release some memory. This means a kind of resource manager that knows what objects are critical and what are not, and the system be able to re-request the released objects when and if they later become critical
Another important issue is to be able to roll back without triggering yet another OOM situation. This is something that is hard to control in higher level languages.
Also, the underlying OS must behave predictably with regard to OOM. Linux, for example, will not, if memory overcommit is enabled. Many swap-enabled systems will die sooner than reporting the OOM to the offending application.
And, there's the case when it is not your process that created the situation, so releasing memory does not help if the offending process continues to leak.
Because of all this, it's often the big and embedded systems that employ this techniques, for they have the control over OS and memory to enable them, and the discipline/motivation to implement them.

It is recoverable only if you catch it and handle it correctly.
In same cases, for example, a request tried to allocate a lot memory. It is quite predictable and you can handle it very very well.
However, in many cases in multi-thread application, OOE may also happen on background thread (including created by system/3rd-party library).
It is almost imposable to predict and you may unable to recover the state of all your threads.

No.
An out of memory error from the GC is should not generally be recoverable inside of the current thread. (Recoverable thread (user or kernel) creation and termination should be supported though)
Regarding the counter examples: I'm currently working on a D programming language project which uses NVIDIA's CUDA platform for GPU computing. Instead of manually managing GPU memory, I've created proxy objects to leverage the D's GC. So when the GPU returns an out of memory error, I run a full collect and only raise an exception if it fails a second time. But, this isn't really an example of out of memory recovery, it's more one of GC integration. The other examples of recovery (caches, free-lists, stacks/hashes without auto-shrinking, etc) are all structures that have their own methods of collecting/compacting memory which are separate from the GC and tend not to be local to the allocating function.
So people might implement something like the following:
T new2(T)( lazy T old_new ) {
T obj;
try{
obj = old_new;
}catch(OutOfMemoryException oome) {
foreach(compact; Global_List_Of_Delegates_From_Compatible_Objects)
compact();
obj = old_new;
}
return obj;
}
Which is a decent argument for adding support for registering/unregistering self-collecting/compacting objects to garbage collectors in general.

In the general case, it's not recoverable.
However, if your system includes some form of dynamic caching, an out-of-memory handler can often dump the oldest elements in the cache (or even the whole cache).
Of course, you have to make sure that the "dumping" process requires no new memory allocations :) Also, it can be tricky to recover the specific allocation that failed, unless you're able to plug your cache dumping code directly at the allocator level, so that the failure isn't propagated up to the caller.

It depends on what you mean by running out of memory.
When malloc() fails on most systems, it's because you've run out of address-space.
If most of that memory is taken by cacheing, or by mmap'd regions, you might be able to reclaim some of it by freeing your cache or unmmaping. However this really requires that you know what you're using that memory for- and as you've noticed either most programs don't, or it doesn't make a difference.
If you used setrlimit() on yourself (to protect against unforseen attacks, perhaps, or maybe root did it to you), you can relax the limit in your error handler. I do this very frequently- after prompting the user if possible, and logging the event.
On the other hand, catching stack overflow is a bit more difficult, and isn't portable. I wrote a posixish solution for ECL, and described a Windows implementation, if you're going this route. It was checked into ECL a few months ago, but I can dig up the original patches if you're interested.

Especially in garbage collected environments, it's quote likely that if you catch the OutOfMemory error at a high level of the application, lots of stuff has gone out of scope and can be reclaimed to give you back memory.
In the case of single excessive allocations, the app may be able to continue working flawlessly. Of course, if you have a gradual memory leak, you'll just run into the problem again (more likely sooner than later), but it's still a good idea to give the app a chance to go down gracefully, save unsaved changes in the case of a GUI app, etc.

Yes, OOM is recoverable. As an extreme example, the Unix and Windows operating systems recover quite nicely from OOM conditions, most of the time. The applications fail, but the OS survives (assuming there is enough memory for the OS to properly start up in the first place).
I only cite this example to show that it can be done.
The problem of dealing with OOM is really dependent on your program and environment.
For example, in many cases the place where the OOM happens most likely is NOT the best place to actually recover from an OOM state.
Now, a custom allocator could possibly work as a central point within the code that can handle an OOM. The Java allocator will perform a full GC before is actually throws a OOM exception.
The more "application aware" that your allocator is, the better suited it would be as a central handler and recovery agent for OOM. Using Java again, it's allocator isn't particularly application aware.
This is where something like Java is readily frustrating. You can't override the allocator. So, while you could trap OOM exceptions in your own code, there's nothing saying that some library you're using is properly trapping, or even properly THROWING an OOM exception. It's trivial to create a class that is forever ruined by a OOM exception, as some object gets set to null and "that never happen", and it's never recoverable.
So, yes, OOM is recoverable, but it can be VERY hard, particularly in modern environments like Java and it's plethora of 3rd party libraries of various quality.

The question is tagged "language-agnostic", but it's difficult to answer without considering the language and/or the underlying system. (I see several toher hadns
If memory allocation is implicit, with no mechanism to detect whether a given allocation succeeded or not, then recovering from an out-of-memory condition may be difficult or impossible.
For example, if you call a function that attempts to allocate a huge array, most languages just don't define the behavior if the array can't be allocated. (In Ada this raises a Storage_Error exception, at least in principle, and it should be possible to handle that.)
On the other hand, if you have a mechanism that attempts to allocate memory and is able to report a failure to do so (like C's malloc() or C++'s new), then yes, it's certainly possible to recover from that failure. In at least the cases of malloc() and new, a failed allocation doesn't do anything other than report failure (it doesn't corrupt any internal data structures, for example).
Whether it makes sense to try to recover depends on the application. If the application just can't succeed after an allocation failure, then it should do whatever cleanup it can and terminate. But if the allocation failure merely means that one particular task cannot be performed, or if the task can still be performed more slowly with less memory, then it makes sense to continue operating.
A concrete example: Suppose I'm using a text editor. If I try to perform some operation within the editor that requires a lot of memory, and that operation can't be performed, I want the editor to tell me it can't do what I asked and let me keep editing. Terminating without saving my work would be an unacceptable response. Saving my work and terminating would be better, but is still unnecessarily user-hostile.

This is a difficult question. On first sight it seems having no more memory means "out of luck" but, you must also see that one can get rid of many memory related stuff if one really insist. Let's just take the in other ways broken function strtok which on one hand has no problems with memory stuff. Then take as counterpart g_string_split from the Glib library, which heavily depends on allocation of memory as nearly everything in glib or GObject based programs. One can definitly say in more dynamic languages memory allocation is much more used as in more inflexible languages, especially C. But let us see the alternatives. If you just end the program if you run out of memory, even careful developed code may stop working. But if you have a recoverable error, you can do something about it. So the argument, making it recoverable means that one can choose to "handle" that situation differently (e.g putting aside a memory block for emergencies, or degradation to a less memory extensive program).
So the most compelling reason is. If you provide a way of recovering one can try the recoverying, if you do not have the choice all depends on always getting enough memory...
Regards

It's just puzzling me now.
At work, we have a bundle of applications working together, and memory is running low. While the problem is either make the application bundle go 64-bit (and so, be able to work beyond the 2 Go limits we have on a normal Win32 OS), and/or reduce our use of memory, this problem of "How to recover from a OOM" won't quit my head.
Of course, I have no solution, but still play at searching for one for C++ (because of RAII and exceptions, mainly).
Perhaps a process supposed to recover gracefully should break down its processing in atomic/rollback-able tasks (i.e. using only functions/methods giving strong/nothrow exception guarantee), with a "buffer/pool of memory" reserved for recovering purposes.
Should one of the task fails, the C++ bad_alloc would unwind the stack, free some stack/heap memory through RAII. The recovering feature would then salvage as much as possible (saving the initial data of the task on the disk, to use on a later try), and perhaps register the task data for later try.
I do believe the use of C++ strong/nothrow guanrantees can help a process to survive in low-available-memory conditions, even if it would be akin memory swapping (i.e. slow, somewhat unresponding, etc.), but of course, this is only theory. I just need to get smarter on the subject before trying to simulate this (i.e. creating a C++ program, with a custom new/delete allocator with limited memory, and then try to do some work under those stressful condition).
Well...

Out of memory normally means you have to quit whatever you were doing. If you are careful about cleanup, though, it can leave the program itself operational and able to respond to other requests. It's better to have a program say "Sorry, not enough memory to do " than say "Sorry, out of memory, shutting down."

Out of memory can be caused either by free memory depletion or by trying to allocate an unreasonably big block (like one gig). In "depletion" cases memory shortage is global to the system and usually affects other applications and system services and the whole system might become unstable so it's wise to forget and reboot. In "unreasonably big block" cases no shortage actually occurs and it's safe to continue. The problem is you can't automatically detect which case you're in. So it's safer to make the error non-recoverable and find a workaround for each case you encounter this error - make your program use less memory or in some cases just fix bugs in code that invokes memory allocation.

There are already many good answers here. But I'd like to contribute with another perspective.
Depletion of just about any reusable resource should be recoverable in general. The reasoning is that each and every part of a program is basically a sub program. Just because one sub cannot complete to it's end at this very point in time, does not mean that the entire state of the program is garbage. Just because the parking lot is full of cars does not mean that you trash your car. Either you wait a while for a booth to be free, or you drive to a store further away to buy your cookies.
In most cases there is an alternative way. Making an out of error unrecoverable, effectively removes a lot of options, and none of us like to have anyone decide for us what we can and cannot do.
The same applies to disk space. It's really the same reasoning. And contrary to your insinuation about stack overflow is unrecoverable, i would say that it's and arbitrary limitation. There is no good reason that you should not be able to throw an exception (popping a lot of frames) and then use another less efficient approach to get the job done.
My two cents :-)

If you are really out of memory you are doomed, since you can not free anything anymore.
If you are out of memory, but something like a garbage collector can kick in and free up some memory you are non dead yet.
The other problem is fragmentation. Although you might not be out of memory (fragmented), you might still not be able to allocate the huge chunk you wanna have.

I know you asked for arguments for, but I can only see arguments against.
I don't see anyway to achieve this in a multi-threaded application. How do you know which thread is actually responsible for the out-of-memory error? One thread could allocating new memory constantly and have gc-roots to 99% of the heap, but the first allocation that fails occurs in another thread.
A practical example: whenever I have occurred an OutOfMemoryError in our Java application (running on a JBoss server), it's not like one thread dies and the rest of the server continues to run: no, there are several OOMEs, killing several threads (some of which are JBoss' internal threads). I don't see what I as a programmer could do to recover from that - or even what JBoss could do to recover from it. In fact, I am not even sure you CAN: the javadoc for VirtualMachineError suggests that the JVM may be "broken" after such an error is thrown. But maybe the question was more targeted at language design.

uClibc has an internal static buffer of 8 bytes or so for file I/O when there is no more memory to be allocated dynamically.

What is the compelling argument for making it a recoverable error?
In Java, a compelling argument for not making it a recoverable error is because Java allows OOM to be signalled at any time, including at times where the result could be your program entering an inconsistent state. Reliable recoery from an OOM is therefore impossible; if you catch the OOM exception, you can not rely on any of your program state. See
No-throw VirtualMachineError guarantees

I'm working on SpiderMonkey, the JavaScript VM used in Firefox (and gnome and a few others). When you're out of memory, you may want to do any of the following things:
Run the garbage-collector. We don't run the garbage-collector all the time, as it would kill performance and battery, so by the time you're reaching out of memory error, some garbage may have accumulated.
Free memory. For instance, get rid of some of the in-memory cache.
Kill or postpone non-essential tasks. For instance, unload some tabs that haven't be used in a long time from memory.
Log things to help the developer troubleshoot the out-of-memory error.
Display a semi-nice error message to let the user know what's going on.
...
So yes, there are many reasons to handle out-of-memory errors manually!

I have this:
void *smalloc(size_t size) {
void *mem = null;
for(;;) {
mem = malloc(size);
if(mem == NULL) {
sleep(1);
} else
break;
}
return mem;
}
Which has saved a system a few times already. Just because you're out of memory now, doesn't mean some other part of the system or other processes running on the system have some memory they'll give back soon. You better be very very careful before attempting such tricks, and have all control over every memory you do allocate in your program though.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008