In discussions for a next generation scientific data format a need for some kind of JSON-like data structures (logical grouping of fieldshas been identified. Additionally, it would be preferable to leverage an existing encoding instead of using a custom binary structure. For serialization formats there are many options. Any guidance or insight from those that have experience with these kinds of encodings is appreciated.
Requirements: In our format, data need to be packed in records, normally no bigger than 4096-bytes. Each record must be independently usable. The data must be readable for decades to come. Data archiving and exchange is done by storing and transmitting a sequence of records. Data corruption must only effect the corrupted records, leaving all others in the file/stream/object readable.
Priorities (roughly in order) are:
stability, long term archive usage
performance, mostly read
ability to store opaque blobs
size
simplicity
broad software (aka library) support
stream-ability, transmitted and readable as a record is generated (if possible)
We have started to look at Protobuf (Protocol Buffers RFC), CBOR (RFC) and a bit at MessagePack.
Any information from those with experience that would help us determine the best fit or, more importantly, avoid pitfalls and dead-ends, would be greatly appreciated.
Thanks in advance!
Late answer but: You may want to decide if you want a schema-based or self-describing format. Amazon Ion overview talks about some of the pros and cons of these design decisions, plus this other ION ( completely different from Amazon Ion ).
Neither of those fully meet your criteria, But these articles should list a few criteria you might want to consider. Obviously actually being a standard and being adopted are far higher guarantees of longevity than any technical design criteria
Your goal of recovery from data corruption almost certainly something that should be addressed in a separate architectural layer from the matter of encoding of the records. How many records to pack in to a blob/file/stream is really more related to how many records you can afford to sequentially read through before finding the one you might need.
An optimal solution to storage corruption depends on what kind of corruption you consider likely. For example, if you store data on spinning disks your best protection might be different from if you store data on tape. But the details of that are really not an application-level concern. It's better to abstract/outsource that sort of concern.
Modern cloud-based data storage services provide extremely robust protection against corruption, measured in the industry as "durability". For example, even the Microsoft Azure's lowest-cost storage option, Locally Redundant Storage (LRS), stores at least three different copies of any data received, and maintains at least that level of protection for as long as you want. If any copy gets damaged, another is made from one of undamaged ones ASAP. That results in an annual "durability" of 11 nines (99.999999999% durability), and that's the "low-cost" option at Microsoft. The normal redundancy plan, Geo Redundant Storage (GRS), offers durability exceeding 16 nines. See Azure Storage redundancy.
According to Wasabi, eleven-nines durability means that if you have 1 million files stored, you might lose one file every 659,000 years. You are about 411 times more likely to get hit by a meteor than losing a file.
P.S. I previously worked on the Microsoft Azure Storage team, so that's the service I know the best. However, I trust that other cloud-storage options (e.g. Wasabi and Amazon's S3) offer similar durability protection, e.g. Amazon S3 Standard and Wasabi hot storage are like Azure LRS: eleven nines durability. If you are not worried about a meteor strike, you can rest assured you these services won't lose your data anytime soon.
According to this question, a benchmark run on the same machine had very varying results.
I'm not asking about how to use microtime or whichever framework, but rather, how do you make sure that your benchmarks are not biased in any way? Any machine setup, software setup, process setup? Is there a way to make sure your benchmarks can be safely used as a reference?
Basically benchmarking is kind of like a scientific study, so the same rules apply. A benchmark is usually done to answer some kind of question, so start with formulating a good question. After that it is practice and experience to eliminate all the wrong bias.
Make sure you know and document the runtime environment in detail(e.g. switch off power management and other background tasks that might disturb measurements).
Make sure you repeat the experiment (benchmark run) often enough to get good and stable averages and document it.
Make sure you know what you are measuring (e.g. use a working set thats larger than all caches if you want to measure memory performance etc., or using as many threads as you have cores and so on).
In some cases this involves getting caches filled and datasets cached, in other cases you need to do the exact opposite. Depends on the question you want to answer with your benchmark.
It seems to me that any data tier consistency/integrity updates should almost always be handled by a trigger.
I've been told in the past they can reduce performance, but I'm not sure under what circumstances. On one hand I could see increased locking contention when further actions are chained by triggers, but it seems as though the aggregate performance should still be improved by reducing the need for multiple round-trip queries. One counterexample would be logging, which might be better handled asynchronously outside the application critical path for performance. It also seems that one would not want too much application-specific algorithmic complexity implemented in the data tier.
I've read the docs, FAQs, Forum and other sites and witnessed plenty of use cases, but haven't come across a discussion of best practices or anti-patterns.
Are there general rules-of-thumb or specific cases where triggers are not a good idea?
One of the most common mantras in computer science and programming is to never optimize prematurely, meaning that you should not optimize anything until a problem has been identified, since code readability/maintainability is likely to suffer.
However, sometimes you might know that a particular way of doing things will perform poorly. When is it OK to optimize before identifying a problem? What sorts of optimizations are allowable right from the beginning?
For example, using as few DB connections as possible, and paying close attention to that while developing, rather than using a new connection as needed and worrying about the performance cost later
I think you are missing the point of that dictum. There's nothing wrong with doing something the most efficient way possible right from the start, provided it's also clear, straight forward, etc.
The point is that you should not tie yourself (and worse, your code) in knots trying to solve problems that may not even exist. Save that level of extreme optimizations, which are often costly in terms of development, maintenance, technical debt, bug breeding grounds, portability, etc. for cases where you really need it.
I think you're looking at this the wrong way. The point of avoiding premature optimization isn't to avoid optimizing, it's to avoid the mindset you can fall into.
Write your algorithm in the clearest way that you can first. Then make sure it's correct. Then (and only then) worry about performance. But also think about maintenance etc.
If you follow this approach, then your question answers itself. The only "optimizations" that are allowable right from the beginning are those that are at least as clear as the straightforward approach.
The best optimization you can make at any time is to pick the correct algorithm for the problem. It's amazing how often a little thought yields a better approach that will save orders of magnitude, rather than a few percent. It's a complete win.
Things to look for:
Mathematical formulas rather than iteration.
Patterns that are well known and documented.
Existing code / components
IMHO, none. Write your code without ever thinking about "optimisation". Instead, think "clarity", "correctness", "maintainability" and "testability".
From wikipedia:
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil. Yet we should not
pass up our opportunities in that
critical 3%.
- Donald Knuth
I think that sums it up. The question is knowing if you are in the 3% and what route to take. Personally I ignore most optimizations until I at least get my code working. Usually as a separate pass with a profiler so I can make sure I am optimizing things that actually matter. Often times code simply runs fast enough that anything you do will have little or no effect.
If you don't have a performance problem, then you should not sacrifice readability for performance. However, when choosing a way to implement some functionality, you should avoid using code you know is problematic from a performance point of view. So if there are 2 ways to implement a function, choose the one likely to perform better, but if it's not the most intuitive solution, make sure you put in some comments as to why you coded it that way.
As you develop in your career as a developer, you'll simply grow in awareness of better, more reasonable approaches to various problems. In most cases I can think of,
performance enhancement work resulted in code that was actually smaller and simpler than some complex tangle that evolved from working through a problem. As you get better, such simpler, faster solutions just become easier and more natural to generate.
Update: I'm voting +1 for everyone on the thread so far because the answers are so good. In particular, DWC has captured the essence of my position with some wonderful examples.
Documentation
Documenting your code is the #1 optimization (of the development process) that you can do right from the get go. As projects grow, the more people you interact with and the more people need to understand what you wrote, the more time you will spend
Toolkits
Make sure your toolkit is appropriate for the application you're developing. If you're making a small app, there's no reason to invoke the mighty power of an Eclipse based GUI system.
Complilers
Let the compiler do the tough work. Most of the time, optimization switches on a compiler will do most of the important things you need.
System Specific Optimizations
Especially in the embedded world, gain an understanding of the underlying architecture of the CPU and system you're interacting with. For example, on a Coldfire CPU, you can gain large performance improvements by ensuring that your data lies on the proper byte boundary.
Algorithms
Strive to make access algorithms O(1) or O(Log N). Strive to make iteration over a list no more than O(N). If you're dealing with large amounts of data, avoid anything more than O(N^2) if it's at all possible.
Code Tricks
Avoid, if possible. This is an optimization in itself - an optimization to make your application more maintainable in the long run.
You should avoid all optimizations if the only belief that the code you are optimizing will be slow. The only code you should optimize is when you know it is slow (preferably through a profiler).
If you write clear, easy to understand code then odds are it'll be fast enough, and if it isn't then when you go to speed it up it should be easier to do.
That being said, common sense should apply (!). Should you read a file over and over again or should you cache the results? Probably cache the results. So from a high level architecture point of view you should be thinking of optimization.
The "evil" part of optimization is the "sins" that are committed in the name of making something faster - those sins generally result in the code being very hard to understand. I am not 100% sure this is one of them.. but look at this question here, this may or may not be an example of optimization (could be the way the person thought to do it), but there are more obvious ways to solve the problem than what was chosen.
Another thing you can do, which I recently did do, is when you are writing the code and you need to decide how to do something write it both ways and run it through a profiler. Then pick the clearest way to code it unless there is a large difference in speed/memory (depending on what you are after). That way you are not guessing at what is "better" and you can document why you did it that way so that someone doesn't change it later.
The case that I was doing was using memory mapped files -vs- stream I/O... the memory mapped file was significantly faster than the other way, so I wasn't concerned if the code was harder to follow (it wasn't) because the speed up was significant.
Another case I had was deciding to "intern" String in Java or not. Doing so should save space, but at a cost of time. In my case the space savings wasn't huge, and the time was double, so I didn't do the interning. Documenting it lets someone else know not to bother interning it (or if they want to see if a newer version of Java makes it faster then they can try).
In addition to being clear and straightforward, you also have to take a reasonable amount of time to implement the code correctly. If it takes you a day to get the code to work right, instead of the two hours it would have taken if you'd just written it, then you've quite possibly wasted time you could have spent on fixing the real performance problem (Knuth's 3%).
Agree with Neil's opinion here, doing performance optimizations in code right away is a bad development practice.
IMHO, performance optimization is dependent on your system design. If your system has been designed poorly, from the perspective of performance, no amount of code optimization will get you 'good' performance - you may get relatively better performance, but not good performance.
For instance, if one intends to build an application that accesses a database, a well designed data model, that has been de-normalized just enough, if likely to yield better performance characteristics than its opposite - a poorly designed data model that has been optimized/tuned to obtain relatively better performance.
Of course, one must not forget requirements in this mix. There are implicit performance requirements that one must consider during design - designing a public facing web site often requires that you reduce server-side trips to ensure a 'high-performance' feel to the end user. That doesn't mean that you rebuild the DOM on the browser on every action and repaint the same (I've seen this in reality), but that you rebuild a portion of the DOM and let the browser do the rest (which would have been handled by a sensible designer who understood the implicit requirements).
Picking appropriate data structures. I'm not even sure it counts as optimizing but it can affect the structure of your app (thus good to do early on) and greatly increase performance.
Don't call Collection.ElementCount directly in the loop check expression if you know for sure this value will be calculated on each pass.
Instead of:
for (int i = 0; i < myArray.Count; ++i)
{
// Do something
}
Do:
int elementCount = myArray.Count;
for (int i = 0; i < elementCount ; ++i)
{
// Do something
}
A classical case.
Of course, you have to know what kind of collection it is (actually, how the Count property/method is implemented). May not necessarily be costy.
If one has a peer-to-peer system that can be queried, one would like to
reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together)
avoid excess storage at each node
assure good availability to even moderately rare items in the face of client downtime, hardware failure, and users leaving (possibly detecting rare items for archivists/historians)
avoid queries failing to find matches in the event of network partitions
Given these requirements:
Are there any standard approaches? If not, is there any respected, but experimental, research? I'm familiar some with distribution schemes, but I haven't seen anything really address learning for robustness.
Am I missing any obvious criteria?
Is anybody interested in working on/solving this problem? (If so, I'm happy to open-source part of a very lame simulator I threw together this weekend, and generally offer unhelpful advice).
#cdv: I've now watched the video and it is very good, and although I don't feel it quite gets to a pluggable distribution strategy, it's definitely 90% of the way there. The questions, however, highlight useful differences with this approach that address some of my further concerns, and gives me some references to follow up on. Thus, I'm provisionally accepting your answer, although I consider the question open.
There are multiple systems out there with various aspects of what you seek and each making different compromises, including but not limited to:
Amazon's Dynamo: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
Kai: http://www.slideshare.net/takemaru/kai-an-open-source-implementation-of-amazons-dynamo-472179
Hadoop: http://hadoop.apache.org/core/docs/current/hdfs_design.html
Chord: http://pdos.csail.mit.edu/chord/
Beehive: http://www.cs.cornell.edu/People/egs/beehive/
and many others. After building a custom system along those lines, I let some of the building blocks out in open source form as well: http://code.google.com/p/distributerl/
(that's not a whole system, but a few libraries useful in building one)