How to robustly, but minimally, distribute items across a peer-to-peer system - language-agnostic

If one has a peer-to-peer system that can be queried, one would like to
reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together)
avoid excess storage at each node
assure good availability to even moderately rare items in the face of client downtime, hardware failure, and users leaving (possibly detecting rare items for archivists/historians)
avoid queries failing to find matches in the event of network partitions
Given these requirements:
Are there any standard approaches? If not, is there any respected, but experimental, research? I'm familiar some with distribution schemes, but I haven't seen anything really address learning for robustness.
Am I missing any obvious criteria?
Is anybody interested in working on/solving this problem? (If so, I'm happy to open-source part of a very lame simulator I threw together this weekend, and generally offer unhelpful advice).
#cdv: I've now watched the video and it is very good, and although I don't feel it quite gets to a pluggable distribution strategy, it's definitely 90% of the way there. The questions, however, highlight useful differences with this approach that address some of my further concerns, and gives me some references to follow up on. Thus, I'm provisionally accepting your answer, although I consider the question open.

There are multiple systems out there with various aspects of what you seek and each making different compromises, including but not limited to:
Amazon's Dynamo: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
Kai: http://www.slideshare.net/takemaru/kai-an-open-source-implementation-of-amazons-dynamo-472179
Hadoop: http://hadoop.apache.org/core/docs/current/hdfs_design.html
Chord: http://pdos.csail.mit.edu/chord/
Beehive: http://www.cs.cornell.edu/People/egs/beehive/
and many others. After building a custom system along those lines, I let some of the building blocks out in open source form as well: http://code.google.com/p/distributerl/
(that's not a whole system, but a few libraries useful in building one)

Related

So was that Data Structures & Algorithms course really useful after all?

I remember when I was in DSA I was like wtf O(n) and wondering where would I use it other than in grad school or if you're not a PhD like Bloch. Somehow uses for it does pop up in business analysis, so I was wondering when have you guys had to call up your Big O skills to see how to write an algorithm, which data structure did you use to fit or whether you had to actually create a new ds (like your own implementation of a splay tree or trie).
Understanding Data Structures has been fundamental to many of the projects I've worked on, and that goes beyond the ten minute song 'n dance one does when asked such a question in an interview situation.
Granted that modern environments with all sorts of collection classes can make light work of storing and accessing large amounts of data, but having an understanding that a particular problem is best solved with a particular data structure can be a great timesaver. And by "timesaver" I mean "the difference between something working and not working".
Honestly, being able to answer that stuff is my biggest criterion for taking interviewees seriously in an interview. Knowing how basic data structures work, basic O(n) analysis, and some light theory is really crucial to being able to write large applications successfully.
It's important in the interview because it's important in the job. I've worked with techs in the past that were self taught, without taking the data structures course or reading a data structures book, and their code is occasionally bad in ways they should have seen coming.
If you don't know that n2 is going to run slowly compared to n log n, you've got more to learn.
As far as the later half of the data structures courses, it isn't generally applicable to most tech jobs, but if you ever do wind up needing it, you'll wish you had paid more attention.
Big-O notation is one of the basic notations used when describing algorithms implemented by a particular library. For example, all documentation on STL that I've seen describes various operations in terms of big-O, so naturally you have to e.g. understand the difference between O(1), O(log n) and O(n) to understand the implications of your choice of STL containers and algorithms. MSDN also does that for .NET classes, and IIRC Java documentation does that for standard Java classes. So, I'd say that knowing the notation is pretty much a requirement for understanding documentation of most popular frameworks out there.
Sure (even though I'm a humble MS in EE -- no PhD, no CS, differently from my colleague Joshua Block), I write a lot of stuff that needs to be highly scalable (or components that may need to be reused in highly scalable apps), so big-O considerations are most always at work in my design (and it's not hard to take them into account). The data structures I use are almost always from Python's simple but rich supply (which I did lend a hand developing;-), rarely is a totally custom one needed (rather than building on top of list, dict, etc); but when it does happen (e.g. the bitvectors in my open source project gmpy), no big deal.
I was able to use B-Trees right when I learned about them in algorithm class (that was about 15 years ago when there were much less open source implementations available). And even later the knowledge about the differences of e. g. container classes came in handy...
Absolutely: even though stacks, queues, etc. are pretty straightforward, it helps to have been introduced to them in a disciplined fashion.
B-Tree's and more advanced sorting are a bit more difficult so learning them early was a big benefit and I have indeed had to implement each of them at various points.
Finally, I created an algorithm for single-connected components a few years back that was significantly better than the one our signal-processing team was using but I couldn't convince them that it was better until I could show that it was O(n) complexity rather than O(nlogn).
...just to name a few examples.
Of course, if you are content to remain a CRUD-system hacker with no real desire to do more than collect a paycheck, then it may not be necessary...
I found my knowledge of data structures very useful when I needed to implement a customizable event-driven system about ten years ago. That's the biggie, but I use that sort of knowledge fairly frequently in lesser ways.
For me, knowing the exact algorithms has been... nice as background knowledge. However, the thing that's been the most useful is the more general background of having to pay attention to how different pieces of an algorithm interact. For instance, there can be places in code where moving one piece of code (ie, outside a loop) can make a huge difference in both time and space.
Its less of the specific knowledge the course taught and, rather, more that it acted like several years of experience. The course took something that might take years to encounter (have drilled into you) all the variations of in pure "real world experience" and condensed it.
The title of your question asks about data structures and algorithms, but the body of your question focuses on complexity analysis, so I'll focus on that too:
There are lots of programming jobs where being able to do complexity analysis is at least occasionally useful. See What career can I hope for if I like algorithms? for some examples of these.
I can think of several instances in my career where either I or a co-worker have discovered a a piece of code where the (usually time, sometimes space) complexity was higher that it should have been. eg: something that was quadratic or cubic when it could have been linear or nlog(n). Such code would work fine when given small inputs, but on larger inputs would quickly become really slow or consume all available memory. Knowing alternative algorithms and data structures, their complexities, and also how to analyze the complexity to build new algorithms is vital in being able to correct these problems (or avoid them in the first place).
Networking is all I've used it: in an implementation of traveling salesman.
Unfortunately I do a lot of "line of business" and "forms over data" apps, so most problems I work on can be solved by hammering together arrays, linked lists, and hash tables. However, I've had the chance to work my data structures magic here and there:
Due to weird complex business rules, I worked on an application which used a custom thread pool implemented as a leftist-heap.
My dev team struggled to write a complex multithreaded app. It was plagued with race conditions, dead locks, and lousy performance due to very fine-grained locking. We re-worked the code to share state between threads, opting to write a very light-weight wrapper to facilitate message passing. Next, we converting our linked lists and hash tables to immutable stacks and immutable style and immutable red-black trees, we had no more problems with thread safety or performance. The resulting code was immaculate and surprisingly readable.
Frequently, a business rules engine requires you to roll your own state machine, which is very naturally modelled as a graph where vertexes and states and edges are transitions between states.
If for no other reasons, I'm glad I took the time to readable about data structures and algorithms simply to be able picture novel problems a little differently, especially combinatorial problems and graph problems. Graph theory is no longer a synonym for "scary".

Under what circumstances is it reasonable to use a vendor-specific API?

I'm working with Websphere Portal and Oracle. On two occasions last week, I found that the most direct way to code a particular solution involved using API's provided by IBM and Oracle. However, I realize every line of code written using these vendor API's renders us a little bit more locked in to these systems (particularly the portal, which we only implemented recently).
How would you decide whether the benefits of using a vendor API outweigh the costs of being tied (Bound? Shackled? Fettered? Pinioned?) to a particular product?
Obviously a very personal / specific question...
One thing to remember however, is that you can offset some the risks of being "held prisoner" by wrapping whatever proprietary API is supplied into your own defined API. In doing so you will not only decouple your own code from the 3rd party API, but you may also make it easier to write your own logic, by streamlining the 3rd party API to the minimal set of features you desire. A small risk, risk with this minimalist approach, however, is that you may sometimes "miss-out" on some cool features supplied by the 3rd party library/API.
If other vendors are likely to provide similar functionality with a different API, you can define your own interface for the operations, and create a vendor-specific implementation. This helps you avoid lock-in.
Whether this is worthwhile depends on several factors: how much application code will depend on the API, how complex is the API (and the wrapper it will require), how likely are you to switch vendors.
The cost-benefit ratio of being tightly bound to a custom/vendor-specific API varies a lot depending on what you're working with and with who will use your product. Vendor lock-in may be an impediment for some customers and may go unnoticed on others. Perhaps you may find these guidelines useful when deciding what to do:
If your product will target a specific company only and they're already bound to some technology stack because they have contracts that make them pay big bucks for support and there's no way that in a near or far future they'll change that stack, you should go lock-in.
If your product will target a variety of companies in the same sector and you verify that most of them have the same kind of systems, then you may go lock-in for now because it's faster or easier and later on you may adapt your code to the minority
If you will target a broad range of customers, try to avoid to lock-ins. However if you note that these will play a fundamental role in your product's time-to-market, you may again use them for your first customers and then adapt it to the needs of the others, but only as they come.
This is a pretty subjective question. My take is that if you're using Oracle, and don't have immediate plans to move away from Oracle, I can't think of a good reason to deny yourself the power and flexibility of APIs provided by the vendor. Abstracting your application to the Nth degree in pursuit of some holy grail of portability often causes more trouble than its worth, and definitely increases your cost. It can also lead to less-than-optimal performance. Even TSQL has been enhanced and tweaked by every DB vendor to the point that you're probably going to be using some proprietary extension in a query at some point; may as well get all of the benefits while you're at it.

Benefits of cross-platform development?

Are there benefits to developing an application on two or more different platforms? Does using a different compiler on even the same platform have benefits?
Yes, especially if you plan to distribute your code for multiple platforms.
But even if you don't cross platform development is a form of futureproofing; if it runs on multiple (diverse) platforms today, it's more likely to run on future platforms than something that was tuned, tweeked, and specialized to work on a version 7.8.3 clean install of vendor X's Q-series boxes (patch level 1452) and nothing else.
There seems to be a benefit in finding and simply preventing bugs with a different compiler and a different OS. Different CPUs can pin down endian issues early. There is the pain at the GUI level if you want to stay native at that level.
Short answer: Yes.
Short of cloning a disk, it is almost impossible to make two systems exactly alike, so you are going to end up running on "different platforms" whether you meant to or not. By specifically confronting and solving the "what if system A doesn't do things like B?" problem head on you are much more likely to find those key assumptions your code makes.
That said, I would say you should get a good chunk of your base code working on system A, and then take a day (or a week or ...) and get it running on system B. It can be very educational.
My education came back in the 80's when I ported a source level C debugger to over 100 flavors of U*NX. Gack!
Are there benefits to developing an application on two or more different platforms?
If this is production software, the obvious reason is the lure of a larger client base. Your product's appeal is magnified the moment the client hears that you support multiple platforms. Remember, most enterprises do not use a single OS or even a single version of the OS. It is fairly typical to find a section using Windows, another Mac and a smaller version some flavor of Linux.
It is also seen that customizing a product for a single platform is often far more tedious than to have it run on multi-platform. The law of diminishing returns kicks in even before you know.
Of course, all of this makes little sense, if you are doing customization work for an existing product for the client's proprietary hardware. But even then, keep an eye out for the entire range of hardware your client has in his repertoire -- you never know when he might ask for it.
Does using a different compiler on even the same platform have benefits?
Yes, again. Different compilers implement different extensions. See to it that you are not dependent on a particular version of a particular compiler.
Further, there may be a bug or two in the compiler itself. Using multiple compilers helps sort these out.
I have further seen bits of a (cross-platform) product using two different compilers -- one was to used in those modules where floating point manipulation required a very high level of accuracy. (Been a while I've heard anyone else do that, but ...)
I've ported a large C++ program, originally Win32, to Linux. It wasn't very difficult. Mostly dealing with compiler incompatibilities, because the MS C++ compiler at the time was non-compliant in various ways. I expect that problem has mostly gone now (until C++0x features start gradually appearing). Also writing a simple platform abstraction library to centralize the platform-specific code in one place. It depends to what extent you are dependent on services from the OS that would be hard to mimic on a new platform.
You don't have to build portability in from the ground up. That's why "porting" is often described as an activity you can perform in one shot after an initial release on your most important platform. You don't have to do it continuously from the very start. Purely for economic reasons, if you can avoid doing work that may never pay off, obviously you should. The cost of porting later on, when really necessary, turns out to be not that bad.
Mostly, there is an existing platform where the application is written for (individual software). But you adress more developers (both platforms), if you decide to provide an independent language.
Also products (standard software) for SMEs can be sold better if they run on different platforms! You can gain access to both markets, WIN&LINUX! (and MacOSx and so on...)
Big companies mostly buy hardware which is supported/certified by the product vendor only to deploy the specified product.
If you develop on multiple platforms at the same time you get the advantage of being able to use different tools. For example I once had a memory overwrite (I still swear I didn't need the +1 for the null byte!) that cause "free" to crash. I brought the code up to speed on Windows and found the overwrite in about 1 minute with Rational Purify... it had taken me a week under Linux of chasing it (valgrind might have found it... but I didn't know about it at the time).
Different compilers on the same or different platforms is, to me, a must as each compiler will report different things, and sometimes the report from one compiler about an error will be gibberish but the other compiler makes it very clear.
Using things like multiple databases while developing means you are much less likely to tie yourself to a particular database which means you can swap out the database if there is a reason to do so. If you want to integrate something that uses Oracle into a existing infrastructure that uses SQL Server for example it can really suck - much better if the Oracle or SQL Server pieces can be moved to the other system (I know of some places that have 3 different databases for their financial systems... ick).
In general, always developing for two or three things means that the odds of you finding mistakes is better, and the odds of the system being more flexible is better.
On the other hand all of that can take time and effort that, at the immediate time, is seen as an unneeded expense.
Some platforms have really dreadful development tools. I once worked in an IB where rather than use Sun's ghastly toolset, peole developed code in VC++ and then ported to Solaris.

Redundancy vs dependencies: which is worse?

When working on developing new software i normally get stuck with the redundancy vs dependencies problem. That is, to either accept a 3rd party library that i have a huge dependencies to or code it myself duplicate all the effect but reduce the dependencies.
Though I've recently been trying to come up with a metric way of weighing up either redundancy in the code and dependencies in the code. For the most part, I've concluded reducing redundancy increases you dependencies in your code. Reducing the dependencies in your code increases redundancy. So its very much counter each other out.
So my question is:
Whats a good metric you've used in the past and do use to weigh up dependencies or redundancy in your code?
One thing I think is soo important is if you choose the dependencies route is you need the tool sets in place so you can quickly examine all the routines and functions that use a specified function. Without those tools set, it seems like redundancy wins.
P.S Following on from an article
Article
I would definitely recommend reading Joels essay on this:
"In Defense of Not-Invented-Here Syndrome"
For a dependency, the best metric I can think of would be "would the world grind to a halt if this disappeared". For example if the STL of C++ magically went away, tons of programs would stop working. If .Net or Java disappeared, our economy would probably take a beating due to the number of products that stopped working...
I would think in those terms. Of course many things are a shade of gray between "end of world" and "meh" if they disappeared. The closer the dependency is to potentially causing the end-of-the-world if it dissapeared, the more likely it is to be stable, have an active user base, have its issues well-knows, etc. The more users the better.
Its analogous to being a small consumer of some hardware component. Sometimes hardware goes obsolete. Why? Because no one uses it. If you're a small company and need to get a component for your product, you will pick what is commonly available--what the "big players" order in large quantities, hoping this means that (a) the component won't disappear, (b) the problems with the component are well known, (c) there's a large, informed user base and (d) it might cost less and be more readily available.
Sometimes though, you need that special flux-capacitor part and you take the risk that the company selling it to you might not care to keep producing flux-capacitors if you are only ordering 20 a year, and no one seems to care :). In this case, it might be worth developing your own flux capacitor instead of relying on that unreliable Doc Brown Inc. Just don't buy Plutonium from the Libyans.
If you've dealt in manufacturing something (especially when you're making far fewer than millions of them per year), you've had to deal with with this problem. Software dependencies, I believe, need to be understood in very similar terms.
How to quantify this into a real metric? Roughly count how many people depend on something. If its high, the risk of the dependency hurting you is much lower, and you can decide at what point the risk is too much.
I hadn't really considered the criteria I use to decide whether to employ a third party library or not before, but having thought about it, probably the following, in no particular order:
How widespread the library is (am I going to be able to find support if I need it)
How likely am I to need to do unexpected things with it (am I going to end up embarking on a three week mission to add the functionality I need?)
How much of it do I need to use (am I going to spend days learning the ins and outs just to make use of one feature)
How stable it appears to be (more trouble than it's worth?)
How stable the interface is (are things likely to change in the next year or two?)
How interesting is the problem (would I be personally better off implementing it myself? will I learn anything useful?)
A possible alternative is to use the external software if it provides a lot of value to your project, but hide this behind a simplified (and more consistent to your project) interface.
This allows you to leverage the power of a third party library, but with much reduced complexity (and as such redundancy) in calling the library. The interface ensures that you don't let the specific style of the third party library bleed into your project and allows you to easily replace it with an internal implementation as and when you think that might be necessary.
A sign of when this is happening might be that the interface you want to support is hampered by the third party library.
The significant downside to this is that it does require extra development and add a certain maintenance impact (this increases with the amount of functionality you need from the library), but allows you to leverage the third party library without too much coupling and without all of your developers needing to understand it.
A good example of this would be the use of an object relation mapper (Hibernate\NHibernate) behind a set of repositories or data access objects, or factories being implemented with an dependency injection framework.

How to measure usability to get hard data?

There are a few posts on usability but none of them was useful to me.
I need a quantitative measure of usability of some part of an application.
I need to estimate it in hard numbers to be able to compare it with future versions (for e.g. reporting purposes). The simplest way is to count clicks and keystrokes, but this seems too simple (for example is the cost of filling a text field a simple sum of typing all the letters ? - I guess it is more complicated).
I need some mathematical model for that so I can estimate the numbers.
Does anyone know anything about this?
P.S. I don't need links to resources about designing user interfaces. I already have them. What I need is a mathematical apparatus to measure existing applications interface usability in hard numbers.
Thanks in advance.
http://www.techsmith.com/morae.asp
This is what Microsoft used in part when they spent millions redesigning Office 2007 with the ribbon toolbar.
Here is how Office 2007 was analyzed:
http://cs.winona.edu/CSConference/2007proceedings/caty.pdf
Be sure to check out the references at the end of the PDF too, there's a ton of good stuff there. Look up how Microsoft did Office 2007 (regardless of how you feel about it), they spent a ton of money on this stuff.
Your main ideas to approach in this are Effectiveness and Efficiency (and, in some cases, Efficacy). The basic points to remember are outlined on this webpage.
What you really want to look at doing is 'inspection' methods of measuring usability. These are typically more expensive to set up (both in terms of time, and finance), but can yield significant results if done properly. These methods include things like heuristic evaluation, which is simply comparing the system interface, and the usage of the system interface, with your usability heuristics (though, from what you've said above, this probably isn't what you're after).
More suited to your use, however, will be 'testing' methods, whereby you observe users performing tasks on your system. This is partially related to the point of effectiveness and efficiency, but can include various things, such as the "Think Aloud" concept (which works really well in certain circumstances, depending on the software being tested).
Jakob Nielsen has a decent (short) article on his website. There's another one, but it's more related to how to test in order to be representative, rather than how to perform the testing itself.
Consider measuring the time to perform critical tasks (using a new user and an experienced user) and the number of data entry errors for performing those tasks.
First you want to define goals: for example increasing the percentage of users who can complete a certain set of tasks, and reducing the time they need for it.
Then, get two cameras, a few users (5-10) give them a list of tasks to complete and ask them to think out loud. Half of the users should use the "old" system, the rest should use the new one.
Review the tapes, measure the time it took, measure success rates, discuss endlessly about interpretations.
Alternatively, you can develop a system for bucket-testing -- it works the same way, though it makes it far more difficult to find out something new. On the other hand, it's much cheaper, so you can do many more iterations. Of course that's limited to sites you can open to public testing.
That obviously implies you're trying to get comparative data between two designs. I can't think of a way of expressing usability as a value.
You might want to look into the GOMS model (Goals, Operators, Methods, and Selection rules). It is a very difficult research tool to use in my opinion, but it does provide a "mathematical" basis to measure performance in a strictly controlled environment. It is best used with "expert" users. See this very interesting case study of Project Ernestine for New England Telephone operators.
Measuring usability quantitatively is an extremely hard problem. I tackled this as a part of my doctoral work. The short answer is, yes, you can measure it; no, you can't use the results in a vacuum. You have to understand why something took longer or shorter; simply comparing numbers is worse than useless, because it's misleading.
For comparing alternate interfaces it works okay. In a longitudinal study, where users are bringing their past expertise with version 1 into their use of version 2, it's not going to be as useful. You will also need to take into account time to learn the interface, including time to re-understand the interface if the user's been away from it. Finally, if the task is of variable difficulty (and this is the usual case in the real world) then your numbers will be all over the map unless you have some way to factor out this difficulty.
GOMS (mentioned above) is a good method to use during the design phase to get an intuition about whether interface A is better than B at doing a specific task. However, it only addresses error-free performance by expert users, and only measures low-level task execution time. If the user figures out a more efficient way to do their work that you haven't thought of, you won't have a GOMS estimate for it and will have to draft one up.
Some specific measures that you could look into:
Measuring clock time for a standard task is good if you want to know what takes a long time. However, lab tests generally involve test subjects working much harder and concentrating much more than they do in everyday work, so comparing results from the lab to real users is going to be misleading.
Error rate: how often the user makes mistakes or backtracks. Especially if you notice the same sort of error occurring over and over again.
Appearance of workarounds; if your users are working around a feature, or taking a bunch of steps that you think are dumb, it may be a sign that your interface doesn't give the tools to figure out how to solve their problems.
Don't underestimate simply asking users how well they thought things went. Subjective usability is finicky but can be revealing.