How do sites like codepad.org and ideone.com sandbox your program? - language-agnostic

I need to compile and run user-submitted scripts on my site, similar to what codepad and ideone do. How can I sandbox these programs so that malicious users don't take down my server?
Specifically, I want to lock them inside an empty directory and prevent them from reading or writing anywhere outside of that, from consuming too much memory or CPU, or from doing anything else malicious.
I will need to communicate with these programs via pipes (over stdin/stdout) from outside the sandbox.

codepad.org has something based on geordi, which runs everything in a chroot (i.e restricted to a subtree of the filesystem) with resource restrictions, and uses the ptrace API to restrict the untrusted program's use of system calls. See http://codepad.org/about .
I've previously used Systrace, another utility for restricting system calls.
If the policy is set up properly, the untrusted program would be prevented from breaking anything in the sandbox or accessing anything it shouldn't, so there might be no need put programs in separate chroots and create and delete them for each run. Although that would provide another layer of protection, which probably wouldn't hurt.

Some time ago I was searching for a sandbox solution to use in an automated assignment evaluation system for CS students. Much like everything else, there is a trade-off between the various properties:
Isolation and access control granularity
Performance and ease of installation/configuration
I eventually decided on a multi-tiered architecture, based on Linux:
Level 0 - Virtualization:
By using one or more virtual machine snapshots for all assignments within a specific time range, it was possible to gain several advantages:
Clear separation of sensitive from non-sensitive data.
At the end of the period (e.g. once per day or after each session) the VM is shutdown and restarted from the snapshot, thus removing any remnants of malicious or rogue code.
A first level of computer resource isolation: each VM has limited disk, CPU and memory resources and the host machine is not directly accessible.
Straight-forward network filtering: By having the VM on an internal interface, the firewall on the host can selectively filter the network connections.
For example, a VM intended for testing students of an introductory programming course could have all incoming and outgoing connections blocked, since students at that level would not have network programming assignments. At higher levels the corresponding VMs could e.g. have all outgoing connections blocked and allow incoming connection only from within the faculty.
It would also make sense to have a separate VM for the Web-based submission system - one that could upload files to the evaluation VMs, but do little else.
Level 1 - Basic cperating-system contraints:
On a Unix OS that would contain the traditional access and resource control mechanisms:
Each sandboxed program could be executed as a separate user, perhaps in a separate chroot jail.
Strict user permissions, possibly with ACLs.
ulimit resource limits on processor time and memory usage.
Execution under nice to reduce priority over more critical processes. On Linux you could also use ionice and cpulimit - I am not sure what equivalents exist on other systems.
Disk quotas.
Per-user connection filtering.
You would probably want to run the compiler as a slightly more privileged user; more memory and CPU time, access to compiler tools and header files e.t.c.
Level 2 - Advanced operating-system constraints:
On Linux I consider that to be the use of a Linux Security Module, such as AppArmor or SELinux to limit access to specific files and/or system calls. Some Linux distributions offer some sandboxing security profiles, but it can still be a long and painful process to get something like this working correctly.
Level 3 - User-space sandboxing solutions:
I have successfully used Systrace in a small scale, as mentioned in this older answer of mine. There several other sandboxing solutions for Linux, such as libsandbox. Such solutions may provide more fine-grained control over the system calls that may be used than LSM-based alternatives, but can have a measurable impact on performance.
Level 4 - Preemptive strikes:
Since you will be compiling the code yourself, rather than executing existing binaries, you have a few additional tools in your hands:
Restrictions based on code metrics; e.g. a simple "Hello World" program should never be larger than 20-30 lines of code.
Selective access to system libraries and header files; if you don't want your users to call connect() you might just restrict access to socket.h.
Static code analysis; disallow assembly code, "weird" string literals (i.e. shell-code) and the use of restricted system functions.
A competent programmer might be able to get around such measures, but as the cost-to-benefit ratio increases they would be far less likely to persist.
Level 0-5 - Monitoring and logging:
You should be monitoring the performance of your system and logging all failed attempts. Not only would you be more likely to interrupt an in-progress attack at a system level, but you might be able to make use of administrative means to protect your system, such as:
calling whatever security officials are in charge of such issues.
finding that persistent little hacker of yours and offering them a job.
The degree of protection that you need and the resources that you are willing to expend to set it up are up to you.

I am the developer of libsandbox mentioned by #thkala, and I do recommend it for use in your project.
Some additional comments on #thkala's answer,
it is fair to classify libsandbox as a user-land tool, but libsandbox does integrate standard OS-level security mechanisms (i.e. chroot, setuid, and resource quota);
restricting access to C/C++ headers, or static analysis of users' code, does NOT prevent system functions like connect() from being called. This is because user code can (1) declare function prototypes by themselves without including system headers, or (2) invoke the underlying, kernel-land system calls without touching wrapper functions in libc;
compile-time protection also deserves attention because malicious C/C++ code can exhaust your CPU with infinite template recursion or pre-processing macro expansion;

Related

In Autosar, how do we decide which SWC/BSW should go to trusted OS Application and which SWC/BSW should go to Non Trusted OS application?

I am trying to design a Autosar system for Cluster application. In the OS, I can see that there are Trusted OS applications and Non Trusted OS applications. I am not able to understand the difference between these. Also I am unable to decide which SWCs and BSW modules should be made Trusted and which ones to make Non Trusted. Kindly help.
I believe these partitions are there to save you development costs, since it's easier to develop QM components.
You should only place ASIL-D (trusted) SwC-s in the trusted partition and SwCs with less safety shall be allocated to non-trusted.
If a trusted SwC has a Connector to non-trusted component, then it shall be prepared for that interface's input data's Physical or Internal constraints being violated.
From AUTOSAR_SWS_OS:
Trusted: An OS-Application that may be executed in privileged mode and
may have unrestricted access to the API and hardware
resources. Only trusted applications can provide trusted
functions.
Non-trusted: An OS-Application that is executed in non-privileged mode has
restricted access to the API and hardware resources.
This gets interesting if you have a mixed criticality application, usually because of different ASILs assigned by functional safety according to ISO26262.
Either you would have to develop the whole AUTOSAR stack and all SWCs according to the highest ASIL, or you would have to ensure the freedom from interference between them.
I am assuming the case that you want to use this freedom from interference way now:
To get there, first you need the functional safety requirements for your software, including their ASIL level.
Then you distribute the implementation of these functional safety requirements across your SWCs, including Runnables.
The ASIL is assigned on SWC level (except another possibility came up recently and I missed that).
Then you split the SWCs with different ASILs in different applications, each having memory and/ or timing protection (to be defined together with your functional safety manager).
If, according to the mapping in your software architecture, the used BSWs also have functional safety responsibility, they also need to be developed according to the respective ASIL. Whether they are, you can learn from the functional safety case provided with your AUTOSAR stack. If there is no-one, your AUTOSAR stack is just for QM and not suited for functional safety applications.
Most of the stuff can theoretically run in non-trusted applications, because there are guarded APIs.
Check your BSW/ MCAL documentation where this is not working.
Also you might need to change this to reduce resource consumption.
Depending on your processor and the chosen protections, this can take a lot of resources/ introduce delays.

Are there any Python 2.7 alternatives to ZeroMQ that are released under the BSD or MIT license?

I am seeking Python 2.7 alternatives to ZeroMQ that are released under the BSD or MIT license. I am looking for something that supports request-reply and pub-sub messaging patterns. I can serialize the data myself if necessary. I found Twisted from Twisted Matrix Labs but it appears to require a blocking event loop, i.e. reactor.run(). I need a library that will run in the background and let my application check messages upon certain events. Are there any other alternatives?
Give nanomsg, a ZeroMQ younger sister, a try - same father, same beauty
Yes, it is licensed under MIT/X11 license.
Yes, REQ/REP - allows to build clusters of stateless services to process user requests
Yes, PUB/SUB - distributes messages to large sets of interested subscribers
Has several Python bindings available
https://github.com/tonysimpson/nanomsg-python (recommended)
https://github.com/sdiehl/pynanomsg
https://github.com/djc/nnpy
Differences between nanomsg and ZeroMQ
( state as of 2014/11 v0.5-beta - courtesy nanomsg.org >>> a-click-thru to the original HyperDoc )
Licensing
nanomsg library is MIT-licensed. What it means is that, unlike with ZeroMQ, you can modify the source code and re-release it under a different license, as a proprietary product, etc. More reasoning about the licensing can be found here.
POSIX Compliance
ZeroMQ API, while modeled on BSD socket API, doesn't match the API fully. nanomsg aims for full POSIX compliance.
Sockets are represented as ints, not void pointers.
Contexts, as known in ZeroMQ, don't exist in nanomsg. This means simpler API (sockets can be created in a single step) as well as the possibility of using the library for communication between different modules in a single process (think of plugins implemented in different languages speaking each to another). More discussion can be found here.
Sending and receiving functions ( nn_send, nn_sendmsg, nn_recv and nn_recvmsg ) fully match POSIX syntax and semantics.
Implementation Language
The library is implemented in C instead of C++.
From user's point of view it means that there's no dependency on C++ runtime (libstdc++ or similar) which may be handy in constrained and embedded environments.
From nanomsg developer's point of view it makes life easier.
Number of memory allocations is drastically reduced as intrusive containers are used instead of C++ STL containers.
The above also means less memory fragmentation, less cache misses, etc.
More discussion on the C vs. C++ topic can be found here and here.
Pluggable Transports and Protocols
In ZeroMQ there was no formal API for plugging in new transports (think WebSockets, DCCP, SCTP) and new protocols (counterparts to REQ/REP, PUB/SUB, etc.) As a consequence there were no new transports added since 2008. No new protocols were implemented either. The formal internal transport API (see transport.h and protocol.h) are meant to mitigate the problem and serve as a base for creating and experimenting with new transports and protocols.
Please, be aware that the two APIs are still new and may experience some tweaking in the future to make them usable in wide variety of scenarios.
nanomsg implements a new SURVEY protocol. The idea is to send a message ("survey") to multiple peers and wait for responses from all of them. For more details check the article here. Also look here.
In financial services it is quite common to use "deliver messages from anyone to everyone else" kind of messaging. To address this use case, there's a new BUS protocol implemented in nanomsg. Check the details here.
Threading Model
One of the big architectural blunders I've done in ZeroMQ is its threading model. Each individual object is managed exclusively by a single thread. That works well for async objects handled by worker threads, however, it becomes a trouble for objects managed by user threads. The thread may be used to do unrelated work for arbitrary time span, e.g. an hour, and during that time the object being managed by it is completely stuck. Some unfortunate consequences are: inability to implement request resending in REQ/REP protocol, PUB/SUB subscriptions not being applied while application is doing other work, and similar. In nanomsg the objects are not tightly bound to particular threads and thus these problems don't exist.
REQ socket in ZeroMQ cannot be really used in real-world environments, as they get stuck if message is lost due to service failure or similar. Users have to use XREQ instead and implement the request re-trying themselves. With nanomsg, the re-try functionality is built into REQ socket.
In nanomsg, both REQ and REP support cancelling the ongoing processing. Simply send a new request without waiting for a reply (in the case of REQ socket) or grab a new request without replying to the previous one (in the case of REP socket).
In ZeroMQ, due to its threading model, bind-first-then-connect-second scenario doesn't work for inproc transport. It is fixed in nanomsg.
For similar reasons auto-reconnect doesn't work for inproc transport in ZeroMQ. This problem is fixed in nanomsg as well.
Finally, nanomsg attempts to make nanomsg sockets thread-safe. While using a single socket from multiple threads in parallel is still discouraged, the way in which ZeroMQ sockets failed randomly in such circumstances proved to be painful and hard to debug.
State Machines
Internal interactions inside the nanomsg library are modeled as a set of state machines. The goal is to avoid the incomprehensible shutdown mechanism as seen in ZeroMQ and thus make the development of the library easier.
For more discussion see here and here.
IOCP Support
One of the long-standing problems in ZeroMQ was that internally it uses BSD socket API even on Windows platform where it is a second class citizen. Using IOCP instead, as appropriate, would require major rewrite of the codebase and thus, in spite of multiple attempts, was never implemented. IOCP is supposed to have better performance characteristics and, even more importantly, it allows to use additional transport mechanisms such as NamedPipes which are not accessible via BSD socket API. For these reasons nanomsg uses IOCP internally on Windows platforms.
Level-triggered Polling
One of the aspects of ZeroMQ that proved really confusing for users was the ability to integrate ZeroMQ sockets into an external event loops by using ZMQ_FD file descriptor. The main source of confusion was that the descriptor is edge-triggered, i.e. it signals only when there were no messages before and a new one arrived. nanomsg uses level-triggered file descriptors instead that simply signal when there's a message available irrespective of whether it was available in the past.
Routing Priorities
nanomsg implements priorities for outbound traffic. You may decide that messages are to be routed to a particular destination in preference, and fall back to an alternative destination only if the primary one is not available.
For more discussion see here.
TCP Transport Enhancements
There's a minor enhancement to TCP transport. When connecting, you can optionally specify the local interface to use for the connection, like this:
nn_connect (s, "tcp://eth0;192.168.0.111:5555").
Asynchronous DNS
DNS queries (e.g. converting hostnames to IP addresses) are done in asynchronous manner. In ZeroMQ such queries were done synchronously, which meant that when DNS was unavailable, the whole library, including the sockets that haven't used DNS, just hung.
Zero-Copy
While ZeroMQ offers a "zero-copy" API, it's not true zero-copy. Rather it's "zero-copy till the message gets to the kernel boundary". From that point on data is copied as with standard TCP. nanomsg, on the other hand, aims at supporting true zero-copy mechanisms such as RDMA (CPU bypass, direct memory-to-memory copying) and shmem (transfer of data between processes on the same box by using shared memory). The API entry points for zero-copy messaging are nn_allocmsg and nn_freemsg functions in combination with NN_MSG option passed to send/recv functions.
Efficient Subscription Matching
In ZeroMQ, simple tries are used to store and match PUB/SUB subscriptions. The subscription mechanism was intended for up to 10,000 subscriptions where simple trie works well. However, there are users who use as much as 150,000,000 subscriptions. In such cases there's a need for a more efficient data structure. Thus, nanomsg uses memory-efficient version of Patricia trie instead of simple trie.
For more details check this article.
Unified Buffer Model
ZeroMQ has a strange double-buffering behaviour. Both the outgoing and incoming data is stored in a message queue and in TCP's tx/rx buffers. What it means, for example, is that if you want to limit the amount of outgoing data, you have to set both ZMQ_SNDBUF and ZMQ_SNDHWM socket options. Given that there's no semantic difference between the two, nanomsg uses only TCP's (or equivalent's) buffers to store the data.
Scalability Protocols
Finally, on philosophical level, nanomsg aims at implementing different "scalability protocols" rather than being a generic networking library. Specifically:
Different protocols are fully separated, you cannot connect REQ socket to SUB socket or similar.
Each protocol embodies a distributed algorithm with well-defined prerequisites (e.g. "the service has to be stateless" in case of REQ/REP) and guarantees (if REQ socket stays alive request will be ultimately processed).
Partial failure is handled by the protocol, not by the user. In fact, it is transparent to the user.
The specifications of the protocols are in /rfc subdirectory.
The goal is to standardise the protocols via IETF.
There's no generic UDP-like socket (ZMQ_ROUTER), you should use L4 protocols for that kind of functionality.

Realtime synchronization of live data over network

How do you sync data between two processes (say client and server) in real time over network?
I have various documents/datasets constructed on the server, which are downloaded and displayed by clients. Once downloaded, the document receives continuous updates in order to remain fresh.
It seems to be a simple and commonly occurring concept, but I cannot find any tools that provide this level of abstraction. I am not even sure what I am looking for. Perhaps there is a similar concept with solid tool support? Perhaps there is a chain of different tools that must be put together? Here's what I have considered so far:
I am required to propagate every change in a single hop (0.5 RTT), which rules out polling (typically >10 RTT) and cache invalidation techniques (1.5 RTT).
Data replication and simple notification broadcasts are not an option, because there is too much data and too many changes. Clients must be able to select specific documents to download and monitor for changes.
I am currently using message passing pattern, which does the job, but it is hopelessly unproductive. It works at way too low level of abstraction. It is laborious, error-prone, and it doesn't scale well with increasing application complexity.
HTTP and other RPC-like techniques are good for the initial fetch, but they encourage polling for subsequent synchronization. When performing reverse requests (from data source to data consumer), change notifications are possible, but it's even more complicated than message passing.
Combining RPC (for the initial fetch) with message passing (for updates) turned out to be a nightmare due to the complexity involved in coordinating communication over the two parallel connections as well as due to the impedance mismatch between the two paradigms. I need something unified.
WebSocket & Comet are popular methods to implement change notification, but they need additional libraries to be productive and I am not aware of any libraries suitable for my application.
Message queues merely put an intermediary on the network while maintaining the basic message passing pattern. Custom message filters/routers allow me to get closer to the live document concept, but I feel like I am implementing custom middleware layer on top of the MQ.
I have tons of additional requirements (native observable data structure API on both ends, incremental updates, custom message filters, custom connection routing, cross-platform, robustness & scalability), but before considering those requirements, I need to find some tools that at least attempt to do what I need. I am trying to avoid in-house frameworks for the standard reasons - cost, time to market, long-term maintenance, and keeping developers happy.
My conclusion at the moment is that there is no such live document synchronization framework. In-house solution is the way to go, but many existing components can be used as part of the solution.
It is pretty simple to layer live document logic on top of WebSocket or any other message passing platform. Server just sends the document as a separate message when the connection is initiated and then after every change. Automated reconnection and some connection monitoring must be added to handle network failures.
Serialization at both ends is a separate problem targeted by many existing libraries. Detecting changes in server-side data structures (needed to initiate push) is yet another separate problem that has its own set of patterns and tools. Incremental updates and many other issues can be solved by intermediaries intercepting the connection.
This approach will work with current technology at the cost of extensive in-house glue code. It can be incrementally substituted with standard components as they become available.
WebSocket already includes resource URIs, routing, and a few other nice features. Useful intermediaries and libraries will likely emerge in the future. HTTP with text/event-stream MIME type is a possible future alternative to WebSocket. The advantage of HTTP is that existing tools can be reused with little modification.
I've completely thrown away the pattern of combining RPC pull with separate push channel despite rich tool support. Pushing everything in 0.5 RTT requires the push channel to use exactly the same technology as the pull channel, i.e. reverse RPC. Reverse RPC is like message passing except it introduces redundant returns, throws away useful connection semantics, and makes it hard to insert content-agnostic intermediaries into the stream.

What tools do distributed programmers lack?

I have a dream to improve the world of distributed programming :)
In particular, I'm feeling a lack of necessary tools for debugging, monitoring, understanding and visualizing the behavior of distributed systems (heck, I had to write my own logger and visualizers to satisfy my requirements), and I'm writing a couple of such tools in my free time.
Community, what tools do you lack with this regard? Please describe one per answer, with a rough idea of what the tool would be supposed to do. Others can point out the existence of such tools, or someone might get inspired and write them.
OK, let me start.
A distributed logger with a high-precision global time axis - allowing to register events from different machines in a distributed system with high precision and independent on the clock offset and drift; with sufficient scalability to handle the load of several hundred machines and several thousand logging processes. Such a logger allows to find transport-level latency bottlenecks in a distributed system by seeing, for example, how many milliseconds it actually takes for a message to travel from the publisher to the subscriber through a message queue, etc.
Syslog is not ok because it's not scalable enough - 50000 logging events per second will be too much for it, and timestamp precision will suffer greatly under such load.
Facebook's Scribe is not ok because it doesn't provide a global time axis.
Actually, both syslog and scribe register events under arrival timestamps, not under occurence timestamps.
Honestly, I don't lack such a tool - I've written one for myself, I'm greatly pleased with it and I'm going to open-source it. But others might.
P.S. I've open-sourced it: http://code.google.com/p/greg
Dear Santa, I would like visualizations of the interactions between components in the distributed system.
I would like a visual representation showing:
The interactions among components, either as a UML collaboration diagram or sequence diagram.
Component shutdown and startup times as self-interactions.
On which hosts components are currently running.
Location of those hosts, if available, within a building or geographically.
Host shutdown and startup times.
I would like to be able to:
Filter the components and/or interactions displayed to show only those of interest.
Record interactions.
Display a desired range of time in a static diagram.
Play back the interactions in an animation, with typical video controls for playing, pausing, rewinding, fast-forwarding.
I've been a good developer all year, and would really like this.
Then again, see this question - How to visualize the behavior of many concurrent multi-stage processes?.
(I'm shamelessly refering to my own stuff, but that's because the problems solved by this stuff were important for me, and the current question is precisely about problems that are important for someone).
You could have a look at some of the tools that come with erlang/OTP. It doesn't have all the features other people suggested, but some of them are quite handy, and built with a lot of experience. Some of these are, for instance:
Debugger that can debug concurrent processes, also remotely, AFAIR
Introspection tools for mnesia/ets tables as well as process heaps
Message tracing
Load monitoring on local and remote nodes
distributed logging and error report system
profiler which works for distributed scenarios
Process/task/application manager for distributed systems
These come of course in addition to the base features the platform provides, like Node discovery, IPC protocol, RPC protocols & services, transparent distribution, distributed built-in database storage, global and node-local registry for process names and all the other underlying stuff that makes the platform tic.
I think this is a great question and here's my 0.02 on a tool I would find really useful.
One of the challenges I find with distributed programming is in the deployment of code to multiple machines. Quite often these machines may have slightly varying configuration or worse have different application settings.
The tool I have in mind would be one that could on demand reach out to all the machines on which the application is deployed and provide system information. If one specifies a settings file or a resource like a registry, it would provide the list for all the machines. It could also look at the user access privileges for the users running the application.
A refinement would be to provide indications when settings are not matching a master list provided by the developer. It could also indicate servers that have differing configurations and provide diff functionality.
This would be really useful for .NET applications since there are so many configurations (machine.config, application.config, IIS Settings, user permissions, etc) that the chances of varying configurations are high.
In my opinion, what is missing is a distributed programming platform...a platform that makes application programming over distributed systems as transparent as non-distributed programming is now.
Isn't it a bit early to work on Tools when we don't even agree on a platform? We have several flavors of actor models, virtual shared memory, UMA, NUMA, synchronous dataflow, tagged token dataflow, multi-hierchical memory vector processors, clusters, message passing mesh or network-on-a-chip, PGAS, DGAS, etc.
Feel free to add more.
To contribute:
I find myself writing a lot of distributed programs by constructing a DAG, which gets transformed into platform-specific code. Every platform optimization is a different kind of transformation rules on this DAG. You can see the same happening in Microsoft's Accelerator and Dryad, Intel's Concurrent Collections, MIT's StreaMIT, etc.
A language-agnostic library that collects all these DAG transformations would save re-inventing the wheel every time.
You can also take a look at Akka:
http://akka.io
Let me notify those who've favourited this question by pointing to the Greg logger - http://code.google.com/p/greg . It is the distributed logger with a high-precision global time axis that I've talked about in the other answer in this thread.
Apart from the mentioned tool for "visualizing the behavior of many concurrent multi-stage processes" (splot), I've also written "tplot" which is appropriate for displaying quantitative patterns in logs.
A large presentation about both tools, with lots of pretty pictures here.

How to get your code ready for Loadbalancing

As we did this in the past, i'd like to gather useful information for everyone moving to loadbalancing, as there are issues which your code must be aware of.
We moved from one apache server to squid as reverse proxy/loadbalancer with three apache servers behind.
We are using PHP/MySQL, so issues may differ.
Things we had to solve:
Sessions
We moved from "default" php sessions (files) to distributed memcached-sessions. Simple solution, has to be done. This way, you also don't need "sticky sessions" on your loadbalancer.
Caching
To our non-distributed apc-cache per webserver, we added anoter memcached-layer for distributed object caching, and replaced all old/outdated filecaching systems with it.
Uploads
Uploads go to a shared (nfs) folder.
Things we optimized for speed:
Static Files
Our main NFS runs a lighttpd, serving (also user-uploaded) images. Squid is aware of that and never queries our apache-nodes for images, which gave a nice performance boost. Squid is also configured to cache those files in ram.
What did you do to get your code/project ready for loadbalancing, any other concerns for people thinking about this move, and which platform/language are you using?
When doing this:
For http nodes, I push hard for a single system image (ocfs2 is good for this) and use either pound or crossroads as a load balancer, depending on the scenario. Nodes should have a small local disk for swap and to avoid most (but not all) headaches of CDSLs.
Then I bring Xen into the mix. If you place a small, temporal amount of information on Xenbus (i.e. how much virtual memory Linux has actually promised to processes per VM aka Committed_AS) you can quickly detect a brain dead load balancer and adjust it. Oracle caught on to this too .. and is now working to improve the balloon driver in Linux.
After that I look at the cost of splitting the database usage for any given app across sqlite3 and whatever db the app wants natively, while realizing that I need to split the db so posix_fadvise() can do its job and not pollute kernel buffers needlessly. Since most DBMS services want to do their own buffering, you must also let them do their own clustering. This really dictates the type of DB cluster that I use and what I do to the balloon driver.
Memcache servers then boot from a skinny initrd, again while the privileged domain watches their memory and CPU use so it knows when to boot more.
The choice of heartbeat / takeover really depends on the given network and the expected usage of the cluster. Its hard to generalize that one.
The end result is typically 5 or 6 physical nodes with quite a bit of memory booting a virtual machine monitor + guests while attached to mirrored storage.
Storage is also hard to describe in general terms.. sometimes I use cluster LVM, sometimes not. The not will change when LVM2 finally moves away from its current string based API.
Finally, all of this coordination results in something like Augeas updating configurations on the fly, based on events communicated via Xenbus. That includes ocfs2 itself, or any other service where configurations just can't reside on a single system image.
This is really an application specific question .. can you give an example? I love memcache, but not everyone can benefit from using it, for instance. Are we reviewing your configuration or talking about best practices in general?
Edit:
Sorry for being so Linux centric ... its typically what I use when designing a cluster.