Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive? - lock-free

I wrote some lock-free code that works fine with local
reads, under most conditions.
Does local spinning on a memory read necessarily imply I
have to ALWAYS insert a memory barrier before the spinning
read?
(To validate this, I managed to produce a reader/writer
combination which results in a reader never seeing the
written value, under certain very specific
conditions--dedicated CPU, process attached to CPU,
optimizer turned all the way up, no other work done in the
loop--so the arrows do point in that direction, but I'm not
entirely sure about the cost of spinning through a memory
barrier.)
What is the cost of spinning through a memory barrier if
there is nothing to be flushed in the cache's store buffer?
i.e., all the process is doing (in C) is
while ( 1 ) {
__sync_synchronize();
v = value;
if ( v != 0 ) {
... something ...
}
}
Am I correct to assume that it's free and it won't encumber
the memory bus with any traffic?
Another way to put this is to ask: does a memory barrier do
anything more than: flush the store buffer, apply the
invalidations to it, and prevent the compiler from
reordering reads/writes across its location?
Disassembling, __sync_synchronize() appears to translate into:
lock orl
From the Intel manual (similarly nebulous for the neophyte):
Volume 3A: System Programming Guide, Part 1 -- 8.1.2
Bus Locking
Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.
[...]
For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).
My translation: "when you say LOCK, this would be expensive, but we're
only doing it where necessary."
#BlankXavier:
I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).
I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.
My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)
I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.
The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.
Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.
"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

I may well not properly have understood the question, but...
If you're spinning, one problem is the compiler optimizing your spin away. Volatile solves this.
The memory barrier, if you have one, will be issued by the writer to the spin lock, not the reader. The writer doesn't actually have to use one - doing so ensures the write is pushed out immediately, but it'll go out pretty soon anyway.
The barrier prevents for a thread executing that code re-ordering across it's location, which is its other cost.

Keep in mind that barriers typically are used to order sets of memory accesses, so your code could very likely also need barriers in other places. For example, it wouldn't be uncommon for the barrier requirement to look like this instead:
while ( 1 ) {
v = pShared->value;
__acquire_barrier() ;
if ( v != 0 ) {
foo( pShared->something ) ;
}
}
This barrier would prevent loads and stores in the if block (ie: pShared->something) from executing before the value load is complete. A typical example is that you have some "producer" that used a store of v != 0 to flag that some other memory (pShared->something) is in some other expected state, as in:
pShared->something = 1 ; // was 0
__release_barrier() ;
pShared->value = 1 ; // was 0
In this typical producer consumer scenario, you'll almost always need paired barriers, one for the store that flags that the auxiliary memory is visible (so that the effects of the value store aren't seen before the something store), and one barrier for the consumer (so that the something load isn't started before the value load is complete).
Those barriers are also platform specific. For example, on powerpc (using the xlC compiler), you'd use __isync() and __lwsync() for the consumer and producer respectively. What barriers are required may also depend on the mechanism that you use for the store and load of value. If you've used an atomic intrinsic that results in an intel LOCK (perhaps implicit), then this will introduce an implicit barrier, so you may not need anything. Additionally, you'll likely also need to judicious use of volatile (or preferably use an atomic implementation that does so under the covers) in order to get the compiler to do what you want.

Related

Is Pinned memory non-atomic read/write safe on Xavier devices?

Double posted here, since I did not get a response I will post here as well.
Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX
I have been trying to find the answer to this across this forum, stack overflow, etc.
So far what I have seen is that there is no need for a atomicRead in cuda because:
“A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella
However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?
My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.
At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?
Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?
According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?
If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?
Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?
My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.
Any advice is much appreciated. TIA.

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?

First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

Cache Implementation in Pipelined Processor

I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.

But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.

How are mutexes implemented?

Are some implementations better than others for specific applications? Is there anything to earn by rolling out your own?

Check out the description of the Test-and-set machine instruction on Wikipedia, which alludes to how atomic operations are achieved at the machine level. I can imagine most language-level mutex implementations rely on machine-level support such as Test-and-set.

Building on Adamski's test-and-set suggestion, you should also look at the concept of "fast user-space mutexes" or futexes.
Futexes have the desirable property that they do not require a kernel system call in the common cases of locking or unlocking an uncontended mutex. In these cases, the user-mode code successfully uses an atomic compare and swap (CAS) operation to lock or unlock the mutex.
If CAS fails, the mutex is contended and a kernel system call -- sys_futex under Linux -- must be used either to wait for the mutex (in the lock case) or to wake other threads (in the unlock case).
If you're serious about implementing this yourself, make sure you also read Ulrich Drepper's paper.

A mutex preferably runs in the kernel of the operating system while keeping the amount of code around it as short as possible, so it can avoid being cut-off while task-switching to another process. The exact implementation is therefore a bit of a secret. It's not complex though. It's basically an object that has a boolean field, which it gets and sets.
When using a counter, it can become a Semaphore.
A mutex is the starting point for a critical section, which uses a mutex internally to see if it can enter a section of code. If the mutex is free, it sets the mutex and executes the code, only to release the mutex when done. When a critical section notices that a mutex is locked, it can wait for the mutex to be released.
Around the basic mutex logic there are wrappers to wrap it in an object.. Then more wrapper objects to make it available outside the kernel. And then another wrapper to make it available in .NET. And then several programmers will write their own wrapper code around this all for their own logical needs. The wrappers around wrappers really make them a murky territory.
Now, with this basic knowledge about the internals of mutexes, all I hope is that you're going to use one implementation that relies on the kernel and the hardware underneath. These would be the most reliable. (If the hardware supports these.) If the mutex that you're using doesn't work at this kernel/hardware level then it can still be reliable but I would advise to not use it, unless there's no alternative.
As far as I know, Windows, Linux and .NET will all use mutexes at kernel/hardware level.
The Wikipedia page that I've linked to explains more about the internal logic and possible implementations. Preferably, a mutex is controlled by the hardware, thus making the whole getting/setting of the mutex an indivisible step. (Just to make sure the system doesn't switch tasks in-between.)

A bit of assembly to demonstrate locking atomically:
; BL is the mutex id
; shared_val, a memory address
CMP [shared_val],BL ; Perhaps it is locked to us anyway
JZ .OutLoop2
.Loop1:
CMP [shared_val],0xFF ; Free
JZ .OutLoop1 ; Yes
pause ; equal to rep nop.
JMP .Loop1 ; Else, retry
.OutLoop1:
; Lock is free, grab it
MOV AL,0xFF
LOCK CMPXCHG [shared_val],BL
JNZ .Loop1 ; Write failed
.OutLoop2: ; Lock Acquired

Interlocked.CompareExchange is enough to implement spinlocks. It's pretty difficult to do right though. See for Joe Duffy's blog for an example of the subtleties involved.

I used Reflector.NET to decompile the source for System.Threading.ReaderWriterLockSlim, which was added to a recent version of the .NET framework.
It mostly uses Interlocked.CompareExchange, Thread.SpinWait and Thread.Sleep to achieve synchronisation. There are a few EventWaitHandle (kernel object) instances that are used under some circumstances.
There's also some complexity added to support reentrancy on a single thread.
If you're interested in this area and working in .NET (or at least, can read it) then you might find it quite interesting to check this class out.

Is "Out Of Memory" A Recoverable Error?

I've been programming a long time, and the programs I see, when they run out of memory, attempt to clean up and exit, i.e. fail gracefully. I can't remember the last time I saw one actually attempt to recover and continue operating normally.
So much processing relies on being able to successfully allocate memory, especially in garbage collected languages, it seems that out of memory errors should be classified as non-recoverable. (Non-recoverable errors include things like stack overflows.)
What is the compelling argument for making it a recoverable error?

It really depends on what you're building.
It's not entirely unreasonable for a webserver to fail one request/response pair but then keep on going for further requests. You'd have to be sure that the single failure didn't have detrimental effects on the global state, however - that would be the tricky bit. Given that a failure causes an exception in most managed environments (e.g. .NET and Java) I suspect that if the exception is handled in "user code" it would be recoverable for future requests - e.g. if one request tried to allocate 10GB of memory and failed, that shouldn't harm the rest of the system. If the system runs out of memory while trying to hand off the request to the user code, however - that kind of thing could be nastier.

In a library, you want to efficiently copy a file. When you do that, you'll usually find that copying using a small number of big chunks is much more effective than copying a lot of smaller ones (say, it's faster to copy a 15MB file by copying 15 1MB chunks than copying 15'000 1K chunks).
But the code works with any chunk size. So while it may be faster with 1MB chunks, if you design for a system where a lot of files are copied, it may be wise to catch OutOfMemoryError and reduce the chunk size until you succeed.
Another place is a cache for Object stored in a database. You want to keep as many objects in the cache as possible but you don't want to interfere with the rest of the application. Since these objects can be recreated, it's a smart way to conserve memory to attach the cache to an out of memory handler to drop entries until the rest of the app has enough room to breathe, again.
Lastly, for image manipulation, you want to load as much of the image into memory as possible. Again, an OOM-handler allows you to implement that without knowing in advance how much memory the user or OS will grant your code.
[EDIT] Note that I work under the assumption here that you've given the application a fixed amount of memory and this amount is smaller than the total available memory excluding swap space. If you can allocate so much memory that part of it has to be swapped out, several of my comments don't make sense anymore.

Users of MATLAB run out of memory all the time when performing arithmetic with large arrays. For example if variable x fits in memory and they run "x+1" then MATLAB allocates space for the result and then fills it. If the allocation fails MATLAB errors and the user can try something else. It would be a disaster if MATLAB exited whenever this use case came up.

OOM should be recoverable because shutdown isn't the only strategy to recovering from OOM.
There is actually a pretty standard solution to the OOM problem at the application level.
As part of you application design determine a safe minimum amount of memory required to recover from an out of memory condition. (Eg. the memory required to auto save documents, bring up warning dialogs, log shutdown data).
At the start of your application or at the start of a critical block, pre-allocate that amount of memory. If you detect an out of memory condition release your guard memory and perform recovery. The strategy can still fail but on the whole gives great bang for the buck.
Note that the application need not shut down. It can display a modal dialog until the OOM condition has been resolved.
I'm not 100% certain but I'm pretty sure 'Code Complete' (required reading for any respectable software engineer) covers this.
P.S. You can extend your application framework to help with this strategy but please don't implement such a policy in a library (good libraries do not make global decisions without an applications consent)

I think that like many things, it's a cost/benefit analysis. You can program in attempted recovery from a malloc() failure - although it may be difficult (your handler had better not fall foul of the same memory shortage it's meant to deal with).
You've already noted that the commonest case is to clean up and fail gracefully. In that case it's been decided that the cost of aborting gracefully is lower than the combination of development cost and performance cost in recovering.
I'm sure you can think of your own examples of situations where terminating the program is a very expensive option (life support machine, spaceship control, long-running and time-critical financial calculation etc.) - although the first line of defence is of course to ensure that the program has predictable memory usage and that the environment can supply that.

I'm working on a system that allocates memory for IO cache to increase performance. Then, on detecting OOM, it takes some of it back, so that the business logic could proceed, even if that means less IO cache and slightly lower write performance.
I also worked with an embedded Java applications that attempted to manage OOM by forcing garbage collection, optionally releasing some of non-critical objects, like pre-fetched or cached data.
The main problems with OOM handling are:
1) being able to re-try in the place where it happened or being able to roll back and re-try from a higher point. Most contemporary programs rely too much on the language to throw and don't really manage where they end up and how to re-try the operation. Usually the context of the operation will be lost, if it wasn't designed to be preserved
2) being able to actually release some memory. This means a kind of resource manager that knows what objects are critical and what are not, and the system be able to re-request the released objects when and if they later become critical
Another important issue is to be able to roll back without triggering yet another OOM situation. This is something that is hard to control in higher level languages.
Also, the underlying OS must behave predictably with regard to OOM. Linux, for example, will not, if memory overcommit is enabled. Many swap-enabled systems will die sooner than reporting the OOM to the offending application.
And, there's the case when it is not your process that created the situation, so releasing memory does not help if the offending process continues to leak.
Because of all this, it's often the big and embedded systems that employ this techniques, for they have the control over OS and memory to enable them, and the discipline/motivation to implement them.

It is recoverable only if you catch it and handle it correctly.
In same cases, for example, a request tried to allocate a lot memory. It is quite predictable and you can handle it very very well.
However, in many cases in multi-thread application, OOE may also happen on background thread (including created by system/3rd-party library).
It is almost imposable to predict and you may unable to recover the state of all your threads.

No.
An out of memory error from the GC is should not generally be recoverable inside of the current thread. (Recoverable thread (user or kernel) creation and termination should be supported though)
Regarding the counter examples: I'm currently working on a D programming language project which uses NVIDIA's CUDA platform for GPU computing. Instead of manually managing GPU memory, I've created proxy objects to leverage the D's GC. So when the GPU returns an out of memory error, I run a full collect and only raise an exception if it fails a second time. But, this isn't really an example of out of memory recovery, it's more one of GC integration. The other examples of recovery (caches, free-lists, stacks/hashes without auto-shrinking, etc) are all structures that have their own methods of collecting/compacting memory which are separate from the GC and tend not to be local to the allocating function.
So people might implement something like the following:
T new2(T)( lazy T old_new ) {
T obj;
try{
obj = old_new;
}catch(OutOfMemoryException oome) {
foreach(compact; Global_List_Of_Delegates_From_Compatible_Objects)
compact();
obj = old_new;
}
return obj;
}
Which is a decent argument for adding support for registering/unregistering self-collecting/compacting objects to garbage collectors in general.

In the general case, it's not recoverable.
However, if your system includes some form of dynamic caching, an out-of-memory handler can often dump the oldest elements in the cache (or even the whole cache).
Of course, you have to make sure that the "dumping" process requires no new memory allocations :) Also, it can be tricky to recover the specific allocation that failed, unless you're able to plug your cache dumping code directly at the allocator level, so that the failure isn't propagated up to the caller.

It depends on what you mean by running out of memory.
When malloc() fails on most systems, it's because you've run out of address-space.
If most of that memory is taken by cacheing, or by mmap'd regions, you might be able to reclaim some of it by freeing your cache or unmmaping. However this really requires that you know what you're using that memory for- and as you've noticed either most programs don't, or it doesn't make a difference.
If you used setrlimit() on yourself (to protect against unforseen attacks, perhaps, or maybe root did it to you), you can relax the limit in your error handler. I do this very frequently- after prompting the user if possible, and logging the event.
On the other hand, catching stack overflow is a bit more difficult, and isn't portable. I wrote a posixish solution for ECL, and described a Windows implementation, if you're going this route. It was checked into ECL a few months ago, but I can dig up the original patches if you're interested.

Especially in garbage collected environments, it's quote likely that if you catch the OutOfMemory error at a high level of the application, lots of stuff has gone out of scope and can be reclaimed to give you back memory.
In the case of single excessive allocations, the app may be able to continue working flawlessly. Of course, if you have a gradual memory leak, you'll just run into the problem again (more likely sooner than later), but it's still a good idea to give the app a chance to go down gracefully, save unsaved changes in the case of a GUI app, etc.

Yes, OOM is recoverable. As an extreme example, the Unix and Windows operating systems recover quite nicely from OOM conditions, most of the time. The applications fail, but the OS survives (assuming there is enough memory for the OS to properly start up in the first place).
I only cite this example to show that it can be done.
The problem of dealing with OOM is really dependent on your program and environment.
For example, in many cases the place where the OOM happens most likely is NOT the best place to actually recover from an OOM state.
Now, a custom allocator could possibly work as a central point within the code that can handle an OOM. The Java allocator will perform a full GC before is actually throws a OOM exception.
The more "application aware" that your allocator is, the better suited it would be as a central handler and recovery agent for OOM. Using Java again, it's allocator isn't particularly application aware.
This is where something like Java is readily frustrating. You can't override the allocator. So, while you could trap OOM exceptions in your own code, there's nothing saying that some library you're using is properly trapping, or even properly THROWING an OOM exception. It's trivial to create a class that is forever ruined by a OOM exception, as some object gets set to null and "that never happen", and it's never recoverable.
So, yes, OOM is recoverable, but it can be VERY hard, particularly in modern environments like Java and it's plethora of 3rd party libraries of various quality.

The question is tagged "language-agnostic", but it's difficult to answer without considering the language and/or the underlying system. (I see several toher hadns
If memory allocation is implicit, with no mechanism to detect whether a given allocation succeeded or not, then recovering from an out-of-memory condition may be difficult or impossible.
For example, if you call a function that attempts to allocate a huge array, most languages just don't define the behavior if the array can't be allocated. (In Ada this raises a Storage_Error exception, at least in principle, and it should be possible to handle that.)
On the other hand, if you have a mechanism that attempts to allocate memory and is able to report a failure to do so (like C's malloc() or C++'s new), then yes, it's certainly possible to recover from that failure. In at least the cases of malloc() and new, a failed allocation doesn't do anything other than report failure (it doesn't corrupt any internal data structures, for example).
Whether it makes sense to try to recover depends on the application. If the application just can't succeed after an allocation failure, then it should do whatever cleanup it can and terminate. But if the allocation failure merely means that one particular task cannot be performed, or if the task can still be performed more slowly with less memory, then it makes sense to continue operating.
A concrete example: Suppose I'm using a text editor. If I try to perform some operation within the editor that requires a lot of memory, and that operation can't be performed, I want the editor to tell me it can't do what I asked and let me keep editing. Terminating without saving my work would be an unacceptable response. Saving my work and terminating would be better, but is still unnecessarily user-hostile.

This is a difficult question. On first sight it seems having no more memory means "out of luck" but, you must also see that one can get rid of many memory related stuff if one really insist. Let's just take the in other ways broken function strtok which on one hand has no problems with memory stuff. Then take as counterpart g_string_split from the Glib library, which heavily depends on allocation of memory as nearly everything in glib or GObject based programs. One can definitly say in more dynamic languages memory allocation is much more used as in more inflexible languages, especially C. But let us see the alternatives. If you just end the program if you run out of memory, even careful developed code may stop working. But if you have a recoverable error, you can do something about it. So the argument, making it recoverable means that one can choose to "handle" that situation differently (e.g putting aside a memory block for emergencies, or degradation to a less memory extensive program).
So the most compelling reason is. If you provide a way of recovering one can try the recoverying, if you do not have the choice all depends on always getting enough memory...
Regards

It's just puzzling me now.
At work, we have a bundle of applications working together, and memory is running low. While the problem is either make the application bundle go 64-bit (and so, be able to work beyond the 2 Go limits we have on a normal Win32 OS), and/or reduce our use of memory, this problem of "How to recover from a OOM" won't quit my head.
Of course, I have no solution, but still play at searching for one for C++ (because of RAII and exceptions, mainly).
Perhaps a process supposed to recover gracefully should break down its processing in atomic/rollback-able tasks (i.e. using only functions/methods giving strong/nothrow exception guarantee), with a "buffer/pool of memory" reserved for recovering purposes.
Should one of the task fails, the C++ bad_alloc would unwind the stack, free some stack/heap memory through RAII. The recovering feature would then salvage as much as possible (saving the initial data of the task on the disk, to use on a later try), and perhaps register the task data for later try.
I do believe the use of C++ strong/nothrow guanrantees can help a process to survive in low-available-memory conditions, even if it would be akin memory swapping (i.e. slow, somewhat unresponding, etc.), but of course, this is only theory. I just need to get smarter on the subject before trying to simulate this (i.e. creating a C++ program, with a custom new/delete allocator with limited memory, and then try to do some work under those stressful condition).
Well...

Out of memory normally means you have to quit whatever you were doing. If you are careful about cleanup, though, it can leave the program itself operational and able to respond to other requests. It's better to have a program say "Sorry, not enough memory to do " than say "Sorry, out of memory, shutting down."

Out of memory can be caused either by free memory depletion or by trying to allocate an unreasonably big block (like one gig). In "depletion" cases memory shortage is global to the system and usually affects other applications and system services and the whole system might become unstable so it's wise to forget and reboot. In "unreasonably big block" cases no shortage actually occurs and it's safe to continue. The problem is you can't automatically detect which case you're in. So it's safer to make the error non-recoverable and find a workaround for each case you encounter this error - make your program use less memory or in some cases just fix bugs in code that invokes memory allocation.

There are already many good answers here. But I'd like to contribute with another perspective.
Depletion of just about any reusable resource should be recoverable in general. The reasoning is that each and every part of a program is basically a sub program. Just because one sub cannot complete to it's end at this very point in time, does not mean that the entire state of the program is garbage. Just because the parking lot is full of cars does not mean that you trash your car. Either you wait a while for a booth to be free, or you drive to a store further away to buy your cookies.
In most cases there is an alternative way. Making an out of error unrecoverable, effectively removes a lot of options, and none of us like to have anyone decide for us what we can and cannot do.
The same applies to disk space. It's really the same reasoning. And contrary to your insinuation about stack overflow is unrecoverable, i would say that it's and arbitrary limitation. There is no good reason that you should not be able to throw an exception (popping a lot of frames) and then use another less efficient approach to get the job done.
My two cents :-)

If you are really out of memory you are doomed, since you can not free anything anymore.
If you are out of memory, but something like a garbage collector can kick in and free up some memory you are non dead yet.
The other problem is fragmentation. Although you might not be out of memory (fragmented), you might still not be able to allocate the huge chunk you wanna have.

I know you asked for arguments for, but I can only see arguments against.
I don't see anyway to achieve this in a multi-threaded application. How do you know which thread is actually responsible for the out-of-memory error? One thread could allocating new memory constantly and have gc-roots to 99% of the heap, but the first allocation that fails occurs in another thread.
A practical example: whenever I have occurred an OutOfMemoryError in our Java application (running on a JBoss server), it's not like one thread dies and the rest of the server continues to run: no, there are several OOMEs, killing several threads (some of which are JBoss' internal threads). I don't see what I as a programmer could do to recover from that - or even what JBoss could do to recover from it. In fact, I am not even sure you CAN: the javadoc for VirtualMachineError suggests that the JVM may be "broken" after such an error is thrown. But maybe the question was more targeted at language design.

uClibc has an internal static buffer of 8 bytes or so for file I/O when there is no more memory to be allocated dynamically.

What is the compelling argument for making it a recoverable error?
In Java, a compelling argument for not making it a recoverable error is because Java allows OOM to be signalled at any time, including at times where the result could be your program entering an inconsistent state. Reliable recoery from an OOM is therefore impossible; if you catch the OOM exception, you can not rely on any of your program state. See
No-throw VirtualMachineError guarantees

I'm working on SpiderMonkey, the JavaScript VM used in Firefox (and gnome and a few others). When you're out of memory, you may want to do any of the following things:
Run the garbage-collector. We don't run the garbage-collector all the time, as it would kill performance and battery, so by the time you're reaching out of memory error, some garbage may have accumulated.
Free memory. For instance, get rid of some of the in-memory cache.
Kill or postpone non-essential tasks. For instance, unload some tabs that haven't be used in a long time from memory.
Log things to help the developer troubleshoot the out-of-memory error.
Display a semi-nice error message to let the user know what's going on.
...
So yes, there are many reasons to handle out-of-memory errors manually!

I have this:
void *smalloc(size_t size) {
void *mem = null;
for(;;) {
mem = malloc(size);
if(mem == NULL) {
sleep(1);
} else
break;
}
return mem;
}
Which has saved a system a few times already. Just because you're out of memory now, doesn't mean some other part of the system or other processes running on the system have some memory they'll give back soon. You better be very very careful before attempting such tricks, and have all control over every memory you do allocate in your program though.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008